Modeling and Analysis of Data-Driven Systems through Computational Neuroscience Wavelet-Deep Optimized Model for Nonlinear Multicomponent Data Forecasting

Complex time series data exists widely in actual systems, and its forecasting has great practical significance. Simultaneously, the classical linear model cannot obtain satisfactory performance due to nonlinearity and multicomponent characteristics. Based on the data-driven mechanism, this paper proposes a deep learning method coupled with Bayesian optimization based on wavelet decomposition to model the time series data and forecasting its trend. Firstly, the data is decomposed by wavelet transform to reduce the complexity of the time series data. The Gated Recurrent Unit (GRU) network is trained as a submodel for each decomposition component. The hyperparameters of wavelet decomposition and each submodel are optimized with Bayesian sequence model-based optimization (SMBO) to develop the modeling accuracy. Finally, the results of all submodels are added to obtain forecasting results. The PM2.5 data collected by the US Air Quality Monitoring Station is used for experiments. By comparing with other networks, it can be found that the proposed method outperforms well in the multisteps forecasting task for the complex time series.


Introduction
Usually, the data we collect in the existing system is complex time-series data, such as air pollution data [1], i.e., PM2.5, PM10, and O3. e forecasting of these pollution content is essential for air quality control. As to the PM2.5 forecasting problem, accurate multisteps forecasting is more meaningful because it can provide faster response time to control and manage air quality. e data at each moment is the value of the last moment that changes over time and is affected by factors such as weather, industrial production, and people's lives. Due to the multicomponent and nonlinearity of the data, the forecasting research is still an open issue, especially for multistep forecasting. e classical method, probability methods [2], is limited by the prior given knowledge. If the assumed model does not match the actual data distribution, it often fails to provide a correct forecasting result. erefore, mechanism-based modeling is challenging for PM2.5 data.
On the other hand, the data-driven learning method [3] is more adaptable for modeling based on the historical data without requiring prior knowledge. erefore, data-driven learning methods, such as the deep learning method, perform better in nonlinear complex dynamic forecasting tasks. us, in recent years, data-driven modeling methods have shown significant advantages in PM2.5 modeling and forecasting.
In this paper, a data-driven model is proposed to the multisteps ahead forecast. Section 2 discusses the related work of time series modeling and forecasting and analyzes probability and learning method's advantages and disadvantages. en, Section 3 gives the proposed model's details, including the decomposition by wavelet transform, the Gated Recurrent Unit (GRU) as a submodel, and Bayesian optimization for the hyperparameters. As a practical example, the experiment based on the Beijing PM2.5 is conducted to improve the proposed model. e results of 2 cases are shown in Section 4. Finally, the conclusions are discussed in Section 5.

Related Works
Probability methods [24,25], such as ARIMA, dynamic regression model, and the autoregressive threshold model, are quite challenging to get accurate model due to the difficulty of obtaining the prior knowledge required. While learning methods, such as the linear regression forecasting model [26][27][28], can get the hidden relationship between the data through adaptive learning.
With the depth of time series forecasting research, the shallow network based on artificial neural network (ANN) has been used to solve the nonlinear time series forecasting problem [29][30][31]. Ye et al. proposed a self-applicable BP neural network, which established the relationship between the aerosol optical depth and the PM2.5 data [32]. Bai et al. gave a method combined with the autoregressive network and BP network for nonlinear data modeling [33]. However, due to the limitation of the network depth, the network cannot accurately model the complex data for accurate multisteps forecasting.
Recently, the emergence of recurrent neural network (RNN) and its higher accuracy in nonlinear time series forecasting tasks have attracted many researchers' attention. For example, the RNNs are used for the forecasting of PM10 and PM2.5 [34]. However, due to the RNN network structure's limitations, the effect will be worse for multistep forecasting.
e emergence of long short-term memory (LSTM) solves the multisteps dependency problem of RNN [35,36]. Unlike the LSTM, the Gated Recurrent Unit (GRU) further simplifies the composition of the LSTM while maintaining the accuracy of the forecasting [37]. For these deep networks, the hyperparameters determine the performance of the model. However, the hyperparameters selected randomly have resulted in lower performance for modeling.
On the other hand, due to the complex dynamical nonlinearity and multiple components with different frequencies [38], deep learning networks' PM2.5 forecasting performance still needs to be improved, especially multistep forecasting. erefore, the data decomposition is added to the forecasting model, and it turns out that this method can indeed improve the accuracy of forecasting.
As one of the decomposition methods, seasonal trend decomposition procedure based on loess (STL) [39][40][41] can obtain trend, seasonal, and residual components of complex data. Similarly, empirical mode decomposition (EMD) [42][43][44] is also often used to analyze time-series data with higher complexity. EMD decomposes a time series into multiple mode functions (IMF), which reflect the frequency differences of the original data. In our previous research [45], we propose a multistep forecasting model for atmospheric PM2.5 concentration based on EMD decomposition. e obtained IMF components were divided into three groups according to their frequency characteristics. Also, the integrated empirical mode decomposition (EEMD) method is used very frequently. It is an improvement of the EMD method. e modal aliasing problem of the EMD method is solved. Similar to the EMD method, it decomposes the time series into multimodal functions. Nguyen et al. proposed a self-enhancement mechanism based on the EEMD method [46], which decomposes the time series into multiple intrinsic mode functions and divides the intrinsic mode into a strong and a weak correlation part by K-means. ese two parts are used for multitask learning and multiview learning, respectively. Finally, the result is obtained through fusion. e decomposition methods have been used in many areas, such as signal processing and system identification. Many state estimation and parameter identification algorithms have been proposed for linear systems [47][48][49], bilinear systems [50][51][52][53][54][55], and nonlinear systems [56][57][58]; its basic idea is the hierarchical identification principle. ese methods can be used for modeling and prediction of time series. Unlike the above decomposition methods, wavelet decomposition [59] can choose an appropriate mother wavelet function to decompose one-dimensional information into multidimensional information. It can set the number of decomposition layers, which means that the number of components is controllable. Wavelet decomposition has an excellent performance in processing multiscale information and can transform the signal into two parts: low frequency and high frequency. Each frequency is independent of the other. Cheng et al. proposed combining wavelet decomposition with traditional forecasting models (including ANN, ARIMA, and SVM) and proposed three hybrid models for short-term PM2.5 forecasting [60]. Wang et al. proposed a forecasting network combining wavelet decomposition and LSTM network to forecast solar radiation intensity in different weather environments and compare it with traditional and single deep learning networks [61].
In this paper, we propose the model with a wavelet decomposition, the GRUs group (WD-GRU) based on the Bayesian optimization for the hyperparameters, and forecast the multisteps for the Beijing PM2.5 data. Our contributions focus on the following: (1) e proposed model utilizes the hyperparameters optimization of the whole model to improve 2 Computational Intelligence and Neuroscience performance. Sequence model-based optimization (SMBO) is utilized to optimize the hyperparameters, including the number of wavelet layers, the type of mother wavelet function, the number of neurons in the first layer of GRU, epoch, the dropout rate, batchsize, and the type of optimizer. (2) e WD-GRU model is proposed, in which wavelet decomposition is used to decompose the original data to reduce the complexity of the time series data. en each component is forecasted separately by GRU, and the result is finally obtained by fusion. Compared with the model with WD-LSTM [61], the model proposed here improves the forecasting performance for the application of PM2.5.

Deep Model
e model proposed here is a model with a combined structure in which wavelet decomposition is used to reduce the data's nonlinear complexity. GRUs are used for each component to forecast, and the final forecasting will be obtained according to each submodel's results. And the hyperparameters of the whole model are optimized through SMBO. We will describe each part of the proposed model in detail below.

Decomposition of Time Series Data.
In this section, we decompose the time series data into a limited number of low-frequency subsequences and high-frequency subsequences according to time series data characteristics. e discrete wavelet transform (DWT) algorithm is used to achieve the above process. It can be found that the subsequence obtained after decomposition has a more stable variance than the original sequence. It can reduce the complexity of data, which helps increase the forecasting performance of the time series.
In numerical analysis, the DWT is derived from the Fourier transform, while the DWT uses the different basic functions, i.e., not the infinite triangular bases, but the finitelength and decaying wavelet bases are used. DWT need to specify a mother wavelet function η(t), such as "db35"; after translation and amplification of η(t), the corresponding function η k,h (t) is obtained by Moreover, we can calculate the corresponding binary function ψ k,h (t): where k is the scaling factor and k ∈ R; k ≠ 0; h is the translation factor and h ∈ R , and t is the time index. In the DWT process, η k,h (t) and ψ k,h (t) are called wavelet bases. For a time series data M(t), the DWT algorithm can be expressed as where a k,h is the low-frequency component with a scaling factor of k and a translation factor of h and d k,h is the highfrequency component with a scaling factor of k and a translation factor of h. m is the length of the original time series data. n is the number of layers of the wavelet decomposition. So DWT can decompose time series into lowfrequency subsequences and high-frequency subsequences. en a low-pass filter (LPF) and a high-pass filter (HPF) are used to obtain low-frequency subsequence A k,h and highfrequency subsequence D k,h based on a k,h and d k,h . Figure 1 shows the wavelet decomposition process in the actual decomposition task, assuming that M(t) is the time series being decomposed. In the first layer of the wavelet decomposition space, the time series M(t) is decomposed into a low-frequency subsequence A 1 and a high-frequency subsequence D 1 . We have the process of the DWT as where A 1 is the result of a 1,h with the length m through LPF and D 1 is the result of d 1,h with the length m through HPF.
en according to the defined number of decomposition layers, the approximate subsequence will continue to decompose according to the decomposition rules, the lowfrequency subsequence A 1 continues to decompose for A 2 and D 2 , and so on. at is to say, for the time series M(t), after the n layer decomposition, the set of A n , D 1 , D 2 , . . . D n } is finally obtained, and there is a relationship To further analyze wavelet decomposition, we take the 100-day Beijing PM2.5 hourly data from January 1, 2016, as an example to perform wavelet decomposition, and the length of the decomposed discrete sequence is 2,400 hours. e db35 mother wavelet function is used, and the number of decomposition levels is 8. Figure 2(a) shows the average low-frequency results of each layer decomposition, and Figure 2(b) shows the high-frequency components of each layer decomposition. We can see that the low-frequency and high-frequency components have apparent changes as the number of decomposition layers increases, and the lines are gradually flat, which shows that the wavelet decomposition successfully decomposes a complex sequence into several subsequences with a single frequency.
For an actual signal, the number of layers n is determined by its length. A signal with the length m can only be decomposed into log 2 m layers at most. ere is not a standard principle to select the level of decomposition layers and the mother wavelet function. In contrast, they will determine the decomposition result, and further have an effect on the forecasting performance. erefore, we will use Bayesian optimization to determine the type of the mother wavelet function and decomposition layers of our model, which will be discussed in Section 3.3.

Deep Submodel for Wavelet Decomposition Components.
To model each component, we use the GRU network, which is an improvement in LSTM. Each neuron in the network is a processing unit that includes an update gate and a reset gate. e update gate is to replace the previous state information with the current state. e reset gate controls the degree of ignoring the last information status, and the GRU unit has only one timing output.  Figure 1: Schematic of the wavelet decomposition process. 4 Computational Intelligence and Neuroscience e calculation formula in each unit when performing forward propagation according to this structure is as follows [62][63][64]: where σ is the Sigmoid activation function, x i represents the input at the time i, z i is the attenuation coefficient of the updaters, r i is the attenuation coefficient of the reset gate, y i−1 is the output value at the time i − 1, y i is the output state vector at a time i, C z and U z are the weights of the update gate, C r and U r are the weights of the reset gate, C a and U a are the weights of the candidate a i , b z , b r , and b a are offset vectors, and * is an element-wise multiplication. e hidden layer of the GRU network is set to 2 layers, and the activation function is "relu." To prevent the training network from overfitting, we added the dropout in each layer. Figure 3 shows the construction of each submodel, where x i , i � 1, 2, . . . , m is the input and y i , i � 1, 2, . . . , m is the output. e model training uses the L 1 loss function, which can obtain better robustness for forecasting the time series data with noise such as PM 2.5. e L 1 loss function is selected as where θ is the weight of the network, y is the forecasting result, and y is the ground truth. anks to deep learning research, there are many ways to update deep networks' weights based on the loss function, such as Adadelta, Adam, and Sgd. And the other hyperparameters, such as dropout rate, batch-size, and the number of epochs, will also affect the capability of deep learning networks. To guarantee performance, we will use the Bayesian SMBO method to select these hyperparameters. Not only can WD-GRU be used for air quality monitoring research but it also forms a new network by combining with other networks, which can be used in other research fields, such as the research on prediction and management control of water environment [65][66][67] and IoT intelligence [68].

Bayesian Sequence Model-Based Optimization (SMBO).
Hyperparameters are one of the keys to deep learning models, directly determining the performance of the model. Due to the deepening of the forecasting model network, the selection of hyperparameters becomes a difficult problem. But, the traditional method of selecting parameters is inefficient. It cannot be used at all when there are too many hyperparameters, so the chosen hyperparameters are also challenging to keep the model perform well. Here, we use the Bayesian SMBO algorithm [69,70] to optimize the hyperparameters, including the hyperparameters of the deep learning model and wavelet decomposition as the number of decomposition layers.
For SMBO, the key is to give an optimization objective function. In the parameter space, the Gaussian process is used to update the posterior distribution of the objective function to seek a group parameter that maximizes the objective function.
e RMSE is used as the objective function: where m is the length of the input series, y i (w) is the forecasting result by the hyperparameters w, and y i is the ground truth. e objective function of SMBO is minimized as where w * is the optimal parameter determined by SMBO; w is a set of input hyperparameters, including not only the weight of the network θ, but also the mother wavelet functions and the level of decomposition layers. W is the multidimensional hyperparameters space defined for the optimized model. e SMBO algorithm can generally be divided into two processes: Gaussian process and hyperparameter selection. In the Gaussian process, the modeling and fitting optimization of the objective function is achieved, and the posterior distribution corresponding to the input w is obtained; in the hyperparameter selection process, the optimal hyperparameters are explored at the minimum cost. According to the objective function g(w), we set the Gaussian distribution as follows: where μ(w) is the average value of g(w) and O(w, w ′ ) is the covariance matrix of g(w). e initial O(w, w ′ ) can be expressed as

Computational Intelligence and Neuroscience
In the process of SMBO searching for optimal parameters, the covariance matrix of the above Gaussian process will continuously change during the iterative process. Assuming that the set of parameters entered in step i + 1 is w i+1 , then the covariance matrix is where . en we get the posterior probability of g(w) as where D is the observation data, μ i+1 (w) is the mean value of g(w) at step i + 1, and σ 2 i+1 (w) is the variance of g(w) at step i + 1.
After obtaining the posterior probability, the next step is to find the optimal parameters through hyperparameter selection. is search method is complicated and takes a lot of time, so we use the following upper confidence bound (UCB) acquisition function to develop the calculation effectiveness: where ζ i+1 is a constant, H(w|D i ) is the UCB acquisition function, and w i+1 is the selected hyperparameter of step i + 1. e SMBO algorithm of the network is shown in Algorithm 1.

Model Framework with the Optimization of
Hyperparameters. Based on the details introduced in Section 3.1-Section 3.3, Figure 4 shows the proposed deep forecasting model. Firstly, the original time series is decomposed based on wavelet decomposition to obtain the corresponding low-frequency subsequences A n and highfrequency subsequences D 1 , D 2 , . . . D n , and then GRU is trained to learn each component of dynamic characteristics. e trained GRU is then used to separately forecast the subsequences obtained by decomposition and finally achieve the forecasting.
During the model's training, the SMBO algorithm optimizes hyperparameters based on the forecasting result and the expected output. Once the optimized parameters have been obtained, the Bayesian optimization process will stop. en the whole model is applied to the forecast.

Dataset and Experimental
Setup. e PM2.5 dataset of the US State Department [71] is used to verify the proposed model's effect, including the average PM2.5 concentration per hour in Beijing's atmosphere from 2013 to 2017, totaling 37,704 hours. e unit of the data is μg/m 3 . We use PM2.5 data to train our proposed model and other comparative models. e learning step is set to 24; that is, the model function is to use the data of 24 hours of the previous day to forecast the value of 24 hours of the next day. e forecast hourly of one day in advance is of great significance, which can help people understand the PM2.5 situation of the next day and plan the next day according to the numerical response's weather conditions.
It is often more reasonable to have enough performance evaluation indicators in the experimental verification stage. We use 5 indicators to assess the performance of our models, including root means square error (RMSE), normalized mean square error (NRMSE), mean absolute error (MAE), symmetric mean absolute percentage error (SMAPE), and Pearson correlation coefficient (R). e smaller the first four indicators are, the more accurate the forecasting is. R represents the Pearson correlation coefficient; the larger the value is, the closer the fitted relation between the ground truth and the forecasted value is. e calculation methods of 5 indicators are as follows: where m is the length of datasets, y is the ground truth, y is the forecasting result, max(y) is the maximum of y, min(y) is the minimum of y, y is the average of the ground truth, and y represents the average of a forecasted value. Our experiment was conducted for the experimental platform using a PC server under Windows 10 operating system. e CPU is Intel (R) i5-6200U CPU, the single-core operating frequency is 2.30 GHz, and the RAM is 8 GB. Use Python 3.7.3 and Keras library to build the WD-GRU forecasting model, making the program more concise.

Case 1: Hyperparameter Selection Based on Bayesian Optimization.
is case is based on the data set mentioned in Section 4.1, and the PM2.5 content per hour in Beijing from March 22, 2016, to April 9, 2016. e hourly content is forecasted, and the forecast period is 24 hours. In this case, we evaluate the hyperparameters of the WD-GRU forecasting model optimized by the Bayesian SMBO algorithm. To verify the SMBO algorithm's effectiveness in determining the number of wavelet decomposition layers and analyze the effect of decomposition layers on the model performance, we compare it with the traditional random search (random search) hyperparameter selection method. 6 Computational Intelligence and Neuroscience Firstly, we define a multidimensional hyperparameter space for the WD-GRU model. Table 1 shows the multidimensional hyperparameter space. e selected hyperparameters include decomposition layers, mother wavelet function, the number of neurons in the first layer, batch-size, epochs, optimizer, and dropout rate. en we optimize the overall RMSE of the forecasting model; after 100 epochs, Bayesian optimization gives a set of optimal hyperparameters. Table 2 shows the selected parameter set by the SMBO algorithm from the search space and the parameters chosen by the random search method in the common deep learning toolbox. We can find that there is obvious difference between the two sets of parameters.
We can note that the number of wavelet decomposition layers selected by SMBO is 8. And the wavelet function given by SMBO is db35. To verify it is reasonable, in this case, we conduct experiments on the effect of WD of different layers with the proposed model. is experiment uses the db35 mother wavelet function to decompose the PM2.5 sequence and then uses the two-layer GRU submodel mentioned in Section 3.2 for model training. e other hyperparameters of the submodel use SMBO parameters in Table 2. en we test and verify the previously defined test set. e specific settings of the decomposition layers for the test model are as follows: Input: the dataset D, the RMSE of the proposed model g(w), the hyperparameter space W, the UCB acquisition function H(w|D i ), the number of parameter selections is T, the number of decomposition components l. Output: Returns the optimal hyperparameter w * .
(5) Use the UCB acquisition function for parameter update w * ← arg max H(w|D (l) i ) (6) Use the w * hyperparameter to train the proposed network to obtainy i ← g(w * ), and calculate and update g (l) min .
i+1 ) (8) end for (9) w * � arg min  Computational Intelligence and Neuroscience (9) Mode no. 9: perform 9 layers of WD and train 10 GRUs for A 9 , D 1 -D 9 , respectively (10) Mode no. 10: perform 10 layers of WD and train 11 GRUs for A 10, D 1 -D 10 , respectively Table 3 shows the forecasting results under different decomposition layers, where red is the best value, and the training of mode no. 8 uses the hyperparameters determined by the SMBO algorithm. We found that as the number of decomposition layers increases, these five indicators show an overall optimization trend. When the number of decomposition layers is set to 6, the value of RMSE has decreased from 48.5712 μg/m 3 to 22.0185 μg/m 3 . Mode no. 8 obtains the least MAE and NRMSE as 16.2063 and 0.0682 and is very close to the optimal value of RMSE, SMAPE, and R.
Based on the above experimental results, we conclude that the number of wavelet decomposition layers determined by SMBO is the optimal solution in the hyperparameter space. Simultaneously, we find that the more layers the decomposition performs, the better the final model's forecasting effect is. When the number of decomposition layers reaches a specific value, the model's performance will no longer improve. If we continue to increase the number of decomposition layers, it will cause the model's overall performance to decline, for example, the mode no.10 with 10 levels. We analyzed this phenomenon and found that when too many decomposition layers are defined for data, false frequencies appear in the decomposition results. ese are not the original signal's information, and this information leads to the deterioration of the forecasting results.
After learning the feasibility of the SMBO, to further explore the advantages of the SMBO algorithm, we then use the two sets of hyperparameters in Table 2 to train the WD-GRU model and conduct the test experiment. Table 4 shows the performance indicators of the two models. We find that the model trained using the hyperparameters determined by the SMBO algorithm is significantly better. RMSE, MAE, NRMSE, SMAPE, and R increased by 4.9129 μg/m 3 , 2.5653 μg/m 3 , 0.0206 μg/m 3 , 0.0349, and 0.035, RMSE reached 21.7300 μg/m 3 , the R was higher than 0.9.
In summary, the SMBO algorithm is useful for selecting the hyperparameters of the proposed model. We verified its feasibility in the experiment of decomposing layers. And through comparing the model trained with the random search hyperparameter method, it is verified that the hyperparameter set determined by the SMBO algorithm can make the proposed model obtain a better forecast effect.

Case 2: Forecasting Performance Verification.
To verify the WD-GRU model's performance advantages, we choose five combination models of decomposition method and deep learning methods to compare with the models proposed in this case. e comparison models used include decomposition-ARIMA-GRU-GRU [38], EMD_RNN [43] (EMD based on GRU), EMDCNN_GRU [45] (EMD and CNNbased on GRU), WD-RNN [34], and WD-LSTM [61]. Figure 5 shows the forecasting trend curves of these six models. We use the red curve to represent the WD-GRU model proposed here. We can see that the WD-GRU model is closest to the ground truth, the forecasting trend curve follows the original data as a whole, and only a certain deviation occurs in some places where the trend jump is large. Table 5 gives the five evaluation indicators; the red value in the table is the optimal value of each indicator. Figures 6  and 7 show various indicators in the form of a histogram.     , EMD-RNN [43], EMDCNN_GRU [45], WD-RNN [34], WD-LSTM [61], and the proposed method. e structure of WD-RNN [34], WD-LSTM [61], and the proposed WD-GRU model in Table 5 differ only in the selection of submodels. However, we see that the GRU network's proposed model as a submodel performs better in various indicators. Compared with the WD-LSTM [61] model, the RMSE of the proposed model is reduced by 4.7035 μg/m 3 . R increased from 0.8932 to 0.9276. Similarly, in the EMD-RNN [43] and EMDCNN_GRU [45] models, the effect of the model using the GRU network is also much better.
selection of hyperparameters during training play a decisive role in the performance of the resulting model. We use SMBO to determine the hyperparameters of the proposed model. ese hyperparameters can develop the performance of the model effectively. e data in Table 5 shows that the WD-LSTM [61] model does not use SMBO to determine the hyperparameters. Its NRMSE is 0.0935μg/m 3 . e NRMSE of the proposed model is 0.0682μg/m 3 . erefore, we can conclude that the degree of improvement is due to the replacement of the submodel with the GRU model and the SMBO method's credit.
In summary, our proposed WD-GRU model has a reasonably good effect on the multisteps forecasting task of PM2.5 concentration in the atmosphere per hour with a period of 24 hours.

Conclusions
is paper proposes a model combining wavelet decomposition and GRU network, in which wavelet decomposition is used to put down the complexity of the series time data.
en the GRUs are used to obtain component forecasting separately and finally achieve results through fusion. e Bayesian optimization is used to optimize each submodel's hyperparameters, wavelet decomposition layers, and mother wavelet function.
Experiments have confirmed that, in the multisteps forecasting of PM2.5 with 24 hours ahead, the model has an excellent performance. It is worth noting that the model we proposed is applicable not only in PM2.5 sequences but also in many similar data-driven forecasting tasks, such as temperature and humidity forecasting.

Data Availability
Data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare no conflicts of interest.