Initialization by a Novel Clustering for Wavelet Neural Network as Time Series Predictor

The architecture and parameter initialization of wavelet neural network are discussed and a novel initialization method is proposed. The new approach can be regarded as a dynamic clustering procedure which will derive the neuron number as well as the initial value of translation and dilation parameters according to the input patterns and the activating wavelets functions. Three simulation examples are given to examine the performance of our method as well as Zhang's heuristic initialization approach. The results show that the new approach not only can decide the WNN structure automatically, but also provides superior initial parameter values that make the optimization process more stable and quickly.


Introduction
An artificial neural network (ANN) is a highly parallel distributed network of connected processing units called neurons. Due to their fascinating characteristics of robustness, fault tolerance, adaptive learning ability, and massive parallel processing capabilities, ANNs possess the capability of learning from examples with both linear and nonlinear relationships between the input and output signals, which makes them a popular tool for time series prediction [1,2], feature extraction [3,4], pattern recognition [5,6], and classification [7,8]. However, ANNs have limited ability to characterize local features, such as discontinuities in curvature, jumps in value, or other edges.
Instead of using common sigmoid activation functions, the wavelet neural network (WNN) employing nonlinear wavelet basis functions [9,10], which are localized in both the time space and frequency space, has been developed as an alternative approach to nonlinear fitting problem. It has been proven that families of wavelet frames are universal approximators [11], which give a theoretical basis to their use in the framework of function approximation and process modeling.
There are two different WNN architectures: one type has fixed wavelet bases possessing fixed dilation and translation parameters (WNN-Type1). In this one only the output layer weights are adjustable. Another type has the variable wavelet base whose dilation and translation parameters and output layer weights are adjustable (WNN-Type2). Several WNN models have been proposed in the literatures. In [12], a fourlayer self-constructing wavelet network (SCWN) controller for nonlinear systems control is described and the orthogonal wavelet functions are adopted as its node functions. In [13], a local linear wavelet neural network (LLWNN) is presented whose connection weights between the hidden layer and output layer of conventional WNN are replaced by a local linear model. In [14], a model of multiwavelet-based neural networks is proposed. The structure of this network is similar to that of the wavelet network, except that the orthonormal scaling functions are replaced by orthonormal multiscaling functions.
A time series is a sequence of observations taken sequentially in time [15]. Time series prediction is an important research and application area. Much effort has been devoted over the past several decades to the development and improvement of time series prediction models. Besides the well-known linear models such as moving average, exponential smoothing, and the autoregressive integrated moving average, nonlinear models including artificial neural 2 Computational Intelligence and Neuroscience network, wavelet neural network, and fuzzy system models also become the well-established time series models. In this paper, the wavelet neural network (WNN) is used as the time series predictor, and the detailed research works are described subsequently.
We adopt WNN-Type2 with adjustable translation and dilation parameters and multiplication form of multidimensional wavelets as the nonlinear model for time series prediction in this paper. Key problems in designing of this type of WNN consist of determining WNN architecture, initializing the translation and dilation vectors, and choosing learning algorithm that can be effectively used for training the WNN. This study mainly focuses on the first two points. In the practical applications, the number of hidden neurons which determines the structure of the network is often set by experience or the time-consuming trial-and-error tests, and the initial values of parameters are often set randomly. Due to the rapidly vanishing property of wavelet functions, the random initialization scheme to the dilation and translation parameters may cause the wavelets' effective response regions out of interest which makes the learning performance very instable. So it is inadvisable to adopt random initialization scheme for dilations and translations in WNN. In [9], Zhang proposes a heuristic initialization procedure which considers the interesting domain of input patterns. But, in its implementation, the wavelet functions used in WNN are not considered, and the resolution reduced gradually according to an established rule which does not take full consideration of sample distribution.
In the present paper, inspired by the localization character of wavelet functions and considering the multiplication form of multidimensional wavelets in the hidden neuron for multivariable inputs, we present a novel initialization approach by the help of a new clustering method for WNN. This approach can determine the unit number of hidden layer and initialize the translation and dilation vectors simultaneously. After performing the training process by gradient descent method, we can see that, besides the capability of neuron number determination, WNN with our initialization method gives more satisfactory and stable results for time series prediction compared to Zhang's heuristic initialization method which is used for this model in some literatures [9,16,17].
The paper is organized as follows. A brief review of wavelet and wavelet-based function approximation is given in Section 2, followed by the introduction of the architecture of wavelet neural network in Section 3. The detailed description of the clustering based initialization approach and the training algorithm are given in Sections 4 and 5. Three simulation experiments on time series prediction problems and the comparison results with Zhang's heuristic initialization method are presented in Section 6. Finally, some conclusions are drawn in the last section.

Wavelet-Based Function Approximation
Wavelets in the following form, are a family of functions generated from one single function ( ) by the operation of dilation and translation. ( ) ∈ 2 ( ) is called a mother wavelet function that satisfies the admissibility condition: wherê( ) is the Fourier transform of ( ) [11,18]. Grossmann and Morlet [19] proved that any function ( ) in 2 ( ) can be represented by where ( , ) given by is the continuous wavelet transform of ( ). Superior to conventional Fourier transform, the wavelet transform (WT) in its continuous form provides a flexible time-frequency window, which narrows when observing high frequency phenomena and widens when analyzing low frequency behavior. Thus, time resolution becomes arbitrarily good at high frequencies, while the frequency resolution becomes arbitrarily good at low frequencies. This kind of analysis is suitable for signals composed of high frequency components with short duration and low frequency components with long duration, which is often the case in practical situations.
As the parameters and are the continuous values, the resulting continuous wavelet transform (CWT) is a very redundant representation and impracticable also. This impracticability is the result of the redundancy. Therefore, the scale and shift parameters are evaluated on a discrete grid of time-scale leading to a discrete set of continuous wavelet functions: The continuous inverse wavelet transform (3) is discretized as If there exist two constants > 0 and < +∞ such that, for any ( ) in 2 ( ), the following inequalities hold: where ‖ ‖ denotes the norm of function ( ) and ⟨ , ⟩ denotes the inner product of functions and , and the family { , } is said to be a frame of 2 ( ). It has been proved that families of wavelet frames of 2 ( ) are universal approximators. Inspired by the wavelet decomposition of ( ) ∈ 2 ( ) in (6) and a single hidden layer network model, Zhang and Benveniste [9] had developed a new neural network model, namely, wavelet neural network (WNN).

Architecture of Wavelet Neural Network
A brief review of wavelet decomposition theory has been given in Section 2, where functions with univariable were concerned. For the modeling of multivariable processes, multidimensional wavelets must be defined. In the present work, multidimensional wavelets are defined as the multiplication of single-dimensional wavelet functions: where x = ( 1 , 2 , . . . , ) T is the input vector and b = ( 1 , 2 , . . . , ) and a = ( 1 , 2 , . . . , ) are the translation and dilation vectors, respectively.
Generalized from radial basis function neural network, WNN is in fact a feed-forward neural network with one hidden layer, wavelet functions as activation functions in the hidden nodes, and a linear output layer. As a result, the network output y = ( 1 , 2 , . . . , ) T is computed as where w = ( ) and y = ( 1 , 2 , . . . , ) define the connecting weights and the bias terms between the hidden layer and the output layer, respectively. is the number of units in hidden layer. These wavelet neurons are usually referred to as wavelons. The architecture of a WNN is illustrated in Figure 1

Initialization Approach of Wavelet Neural Network
Before training the WNN, some factors should be determined in advance, which are the number of wavelons and initial value of parameters ( , , , and ). The former is fixed once the structure of network was determined, while the latter is adjusted by the training algorithm. All these factors are crucial for the performance of network in simulating the real model. In this section, a brief description of wavelet window is presented firstly, and then a novel initialization method based on the dynamic clustering is proposed, which could provide the number of hidden neurons and the initial values of translation and dilation parameters at the same time.

Wavelet Window in Time Domain.
A mother wavelet function ( ) defined by (2) will have sufficient decay, which can be considered as "local response. " In other words, ( ) is a window with center in and radius in time domain, which can be computed by [20] = ∫ As a result, its translated and dilated version , = (( − )/ ) will be concentrated in the region of [ + − , + + ] in the time domain. In this paper, the Mexican Hat wavelet function with symmetric graph (Figure 2) is employed, which is given by the following equation: From (10), the center and radius of Mexican Hat wavelet window in the time domain can be derived as

Initialization by a Novel Clustering Approach for WNN.
The structure of our network is illustrated in Figure 1  . The vector th = (th 1 , th 2 , . . . , th ) T is a threshold vector which is set in advance of executing the algorithm. The cluster mean will be reset as and dimensional radius r = ( 1 , 2 , . . . , ) will be reset as where | | is the cardinal number of and x = ( 1 , 2 , . . . , ) T are the patterns that belong to .
Else, the number of clusters becomes = + 1; create the th cluster = {x } with cluster mean m = x and dimensional radius r = 0.

Remark 1.
(i) Vector th = (th 1 , th 2 , . . . , th ) T in the above procedure is crucial to the clustering result. Large elements of th will lead to a coarse partition, namely, a small , whereas th with small value will lead to a large . In practice, a reasonable th should be determined by the input patterns. In our experiments, we prefer to adopt vector th as in formula (15) to control the cluster scale, which offers moderate results in most times. Consider th = √ var (x ), = 1, 2, . . . , , where x = ( 1 , 2 , . . . , ).
(iii) After the clustering procedure of (1)-(4), the corresponding results help us to determine the number of wavelons in WNN as = and the initial value of translation and dilation vectors as in (17) is a relaxation parameter which satisfies ≥ 1; is the window radius of wavelet function ( ).
(iv) In order to avoid the dilation parameters being zeros, the radius vector of the cluster with single element should be redefined. The minimum value strategy is employed which can be described as = min ( ), (| | = 1, | | > 1). (v) The connecting weights w = ( ) between the hidden layer and the output layer will be randomly initialized in the region [−1, 1], and the bias term y = ( 1 , 2 , . . . , ) initialized as the mean vector of input patterns.

Training Algorithm
Gradient descent method is implemented for training the WNN in this paper. Parameters ( , , , ) are adjusted in the opposite direction of the gradient such that the objective function in (18) of the model should be minimized. Consider where y is the output of network and f is the desired output.
The corrections applied to parameters , , , and are shown as follows: where

Simulation Examples
In this section, WNN model with two different initialization schemes is applied to three time series prediction problems, namely, the prediction of Mackey-Glass, Box-Jenkins, and traffic volume time series. The performance of WNN with the clustering based initialization approach (WNN-CIA) described in Section 4 is compared to Zhang's heuristic initialization approach (WNN-HIA) in each simulation.
Because the architecture of WNN-HIA must be decided in advance, in order to compare directly, we adopt the same architecture with WNN-CIA in the experiments. Relaxation parameter in (17) of WNN-CIA is set as 2.5 in all simulations and the Mexican Hat function defined in (11) is employed as the wavelet function in the hidden neurons of all models. Root mean square error (RMSE) given by (20) of the training/testing set is used as index for comparing Computational Intelligence and Neuroscience 5  [22] 0.0907 Genetic algorithm and fuzzy system [23] 0.049 performances of WNN with different initialization schemes. Consider

Prediction of Mackey-Glass Time Series.
The Mackey-Glass chaotic time series is generated from the following delay differential equation: Here we predict the ( + 6) using the input variables ( ), ( − 6), ( − 12), and ( − 18). Parameters in (21) are set as = 0.2, = 0.1, = 17, and (0) = 1.2 which make the equation show chaotic behavior. One thousand input-output data points are extracted from the Mackey-Glass time series ( ), where = 118 to = 1117. The first 500 data pairs of the series are used as training data, while the remaining 500 data pairs are used to validate the proposed network. After performing the proposed clustering based initialization method proposed in Section 4.2, we get that the number of wavelons is = 9.
For the performance comparison of WNN-CIA with WNN-HIA, some different architectures are employed for WNN-HIA. Table 1 shows the mean and standard deviation (std.) of RMSE for training and testing data obtained when 100 runs were performed by each model. The models are trained for 500 epochs in each run. Some results of different models for testing set are shown in Table 2. The RMSE reduction curve during training and testing of gradient descent algorithm corresponding to the best WNN-CIA model is drawn in Figure 3. prediction error for training and testing data with the training and testing RMSE as 0.0080 and 0.0078.
From Table 1, it can be seen that the performance of WNN with structure and initial parameters derived by the proposed initialization approach is much better than that of WNN-HIA, even when more parameters are employed in the model.

Prediction of Box-Jenkins Time Series.
The gas furnace data of Box and Jenkins (1970), that is, Box-Jenkins time series, was recorded from a combustion process of a methaneair mixture. It is well known and frequently used as a benchmark example for testing identification algorithms. During the process, the portion of methane was randomly changed, keeping a constant gas flow rate. The data set consists of 296 pairs of input-output measurements. The input ( ) is the gas flow into the furnace and the output ( ) is the CO 2 concentration in outlet gas. The sampling interval is 9 s.
In this section, the data set used consists of 292 consecutive values of methane at time ( − 4) and CO 2 produced in a furnace at time ( − 1) as input variables, with the produced CO 2 at time ( ) as an output variable. Namely, variables ( − 4) and ( − 1) are used to predict ( ). The data are partitioned in 200 data points as a training set and the remaining 92 points as a testing set for testing the performance of the proposed network. After performing   the initialization method of WNN proposed in Section 4.2, we get the number of wavelons = 8.
As is done in Section 6.1, different architectures are employed for WNN-HIA for comparison with WNN-CIA whose structure and initial parameters are derived by

Method
Inputs RMSE for testing set Surmann's model [24] 2 0.400 Lee's model [25] 2 0.638 Lin's model [26] 5 0.511 Nie's model [27] 4 0.412 ANFIS model [28] 2 0.085 FuNN model [29] 2 0.071 the proposed approach. Table 3 shows the mean and standard deviation of RMSE for training and testing data obtained when 100 runs were performed by each model. The models are also trained for 500 epochs in each run. Table 4 shows some test results of different models. The RMSE reduction curve during training and testing of gradient descent algorithm corresponding to the best WNN-CIA model is drawn in Figure 6. Figures 7 and 8 show the prediction output of the best WNN-CIA model and the corresponding prediction error for training and testing data with the training and testing RMSE as 0.0186 and 0.0348.   From the data in Table 3, we can see that WNN-CIA outperforms WNN-HIA when the same architectures are employed. When more parameters are employed to WNN-HIA, the performances of WNN-HIA gradually improve. However, WNN-CIA model can make a more stable performance than all WNN-HIA models in the experiments. In order to further examine the effectiveness of the proposed method, simulation experiments of a real-word example, traffic volume time series prediction, are carried out.

Prediction of the Traffic Volume Time Series (A Real-Word
Example). Chen in [21] implemented the neural network  time series models for traffic volume forecasting. In this section, the data of hourly traffic volume for station 5 from [21], which were collected on IR 271 and IR 90 in Cuyahoga County, are used as the real-word time series to examine the performance of WNN-CIA as well as WNN-HIA. There are 105 volume data points collected from June 4, 4:00 pm, to June 8, 12:00 pm, for training purposes, with the remaining 9 data points collected from 1:00 am to 9:00 am on June 9 reserved for model accuracy checking. This is a one-step forecasting with 6 anterior data points as input vector. Data normalizing is done to transfer values of the raw time series into the numbers in interval [0, 1]. After performing the initialization method of WNN proposed in Section 4.1, we get the number of wavelons = 17. Some same and different architectures are employed for WNN-HIA for comparison with WNN-CIA. After 100 experiments with 500 epochs in each run, Table 5 shows the mean and standard deviation of RMSE for training and testing data for two WNN models with different initialization methods. Test results of different models are shown in Table 6. The RMSE reduction curve during training and testing of gradient descent algorithm corresponding to the best WNN-CIA model is drawn in Figure 9. Figures 10 and 11 show the prediction output of the best WNN-CIA model and the corresponding prediction error for training and testing data with the training and testing RMSE as 0.0233 and 0.0335.

8
Computational Intelligence and Neuroscience   From Table 5, we can see that the performance of WNN with the proposed clustering based initialization procedure is also superior to that with heuristic initialization approach even when more parameters are employed in WNN-HIA. It demonstrates again the validity of our methods.

Conclusion
In this paper, a novel initialization procedure for WNN as time series predictor is proposed, which behaves as a dimensional clustering procession. Taking account of the distribution of input patterns and the local response property of wavelet functions, the input patterns can be dynamically classified by the proposed approach. And then the architecture as well as the initial values of translation and dilation parameters of WNN model can be determined accordingly. Simulation results demonstrate that, besides the capability of neuron number determination, WNN with our initialization method can provide satisfactory and stable results for time series prediction.