A hybrid model for time series forecasting is proposed. It is a stacked neural network, containing one normal multilayer perceptron with bipolar sigmoid activation functions, and the other with an exponential activation function in the output layer. As shown by the case studies, the proposed stacked hybrid neural model performs well on a variety of benchmark time series. The combination of weights of the two stack components that leads to optimal performance is also studied.

Many processes found in the real world are nonlinear. Therefore, there is a need for accurate, effective tools to forecast their behavior. Current solutions include general methods such as multiple linear regression, nonlinear regression, artificial neural networks, but also specialized ones, such as

Recently, more hybrid forecasting models have been developed, integrating neural network techniques with conventional models to improve their accuracy. A well-known example is

Other techniques include a combination of the radial basis functions,

The study of nonlinear phenomena is especially important in the field of system dynamics. Research in this area proved that nonlinear differential equations are capable of generating continuous mathematical functions similar to pulse sequences [

Unlike traditional model-based methods, NNs are data-driven and self-adaptive, needing little a priori information about the studied problems.

NNs are known to have good generalization capabilities. After learning the data presented to them, they can correctly predict unseen data, even in the presence of noise.

NNs are universal approximators [

NNs are nonlinear models, and therefore better suited to capture the true nature of many natural processes.

The proposed model is composed of two neural networks, each with one hidden layer, as shown in Figure

The architecture of the stacked neural network.

The inputs of the model are the recent values of the time series, depending on the size of the sliding window

The training of the networks is performed with the classical back-propagation algorithm [

In the following sections, we consider four classical benchmark problems and one original, super-exponential growth problem on which we test the performance of our model. In each case, we divide the available data into 90% for training and 10% for testing. We separately consider a sliding window size of 5 and 10, respectively.

Wolfer’s sunspot time series records the yearly number of spots visible on the surface of the sun. It contains the data from 1700 to 1987, for a total of 288 observations. This data series is considered as nonlinear and non-Gaussian and is often used to evaluate the effectiveness of nonlinear models [

With a window size of 5 points and considering 28 points ahead, the performance of the model on the training set is displayed in Figure

The proposed model performance on the sunspot training data (window size 5).

The forecasting capabilities of the model are displayed in Figure

The proposed model predictions for the sunspot data (window size 5).

In Figure

The evolution of MSE when

Table

The errors of the model for the sunspot data (window size 5).

MSE on original data | MSE on normalized data | ||
---|---|---|---|

Training | Normal NN | 131.865 | |

Exponential NN | 827.933 | ||

Stack | 131.865 | ||

Testing | Normal NN | 511.531 | |

Exponential NN | 2248.409 | ||

Stack | 511.531 |

Next, we increase the size of the sliding window to 10 points. The corresponding performance of the model on the training set is displayed in Figure

The proposed model performance on the sunspot training data (window size 10).

The prediction capabilities of the model are displayed in Figure

The proposed model predictions for the sunspot data (window size 10).

The evolution of the mean square error of the stack on the testing data is displayed as a function of the normal NN weight,

The evolution of MSE when

Table

The errors of the model for the sunspot data (window size 10).

MSE on original data | MSE on normalized data | ||
---|---|---|---|

Training | Normal NN | 96.085 | |

Exponential NN | 811.737 | ||

Stack | 96.085 | ||

Testing | Normal NN | 619.387 | |

Exponential NN | 2466.148 | ||

Stack | 619.387 |

In both cases, we see that the normal neural network can approximate the time series better than the exponential network. When the window size is 10, the model seems to slightly overfit the data compared to the case when the window size is 5, yielding better errors for the training set, but a little worse errors for the testing set.

This series contains the yearly number of lynx trapped in the Mackenzie River district of Northern Canada [

With a window size of 5 points and considering 11 points ahead, the performance of the model on the training set is displayed in Figure

The proposed model performance on the lynx training data (window size 5).

The forecasting capabilities of the model are displayed in Figure

The proposed model predictions for the lynx data (window size 5).

The evolution of the mean square error of the stack on the testing data is displayed as a function of the normal NN weight,

The evolution of MSE when

Table

The errors of the model for the lynx data (window size 5).

MSE on original data | MSE on normalized data | ||
---|---|---|---|

Training | Normal NN | 530,416.473 | |

Exponential NN | 383,963.211 | ||

Stack | 401,015.900 | ||

Testing | Normal NN | 154,951.311 | |

Exponential NN | 113,955.783 | ||

Stack | 109,902.034 |

When size of the window is increased to 10 points, the performance of the model on the training set is that shown in Figure

The proposed model performance on the lynx training data (window size 10).

The testing performance of the model is displayed in Figure

The proposed model predictions for the lynx data (window size 10).

The optimal weights for this stack are

The evolution of MSE when

Table

The errors of the model for the lynx data (window size 10).

MSE on original data | MSE on normalized data | ||
---|---|---|---|

Training | Normal NN | 66,484.341 | |

Exponential NN | 88,404.178 | ||

Stack | 66,426.519 | ||

Testing | Normal NN | 283,132.166 | |

Exponential NN | 421,951.130 | ||

Stack | 283,105.757 |

For this dataset, the exponential network can contribute to the stack result. When the window size is 5, its weight even dominates the stack, and its contribution decreases for a larger window size. It is possible that this phenomenon appears because for a smaller window size, the model may look exponential when learning the high peaks. For a larger window, the model may have a wider perspective, which includes the peaks, and the problem may seem to become more linear. The errors for a window size of 10 are also much smaller for the training set, and larger for the testing set, compared to the errors found for a window set of 5.

This data represents monthly ozone concentrations in parts per million from January 1955 to December 1972 made in downtown Los Angeles [

We consider a window size of 5 points and 24 points ahead for prediction. The performance of the model on the training set is displayed in Figure

The proposed model performance on the ozone training data (window size 5).

The prediction performance of the model results from Figure

The proposed model predictions for the ozone data (window size 5).

The evolution of the mean square error of the stack on the testing data is displayed as a function of the normal NN weight,

The evolution of MSE when

Table

The errors of the model for the ozone data (window size 5).

MSE on original data | MSE on normalized data | ||
---|---|---|---|

Training | Normal NN | 0.704 | |

Exponential NN | 0.680 | ||

Stack | 0.675 | ||

Testing | Normal NN | 0.702 | |

Exponential NN | 0.592 | ||

Stack | 0.589 |

In the case when the size of the window is increased to 10 points, the performance of the model on the training set is that shown in Figure

The proposed model performance on the ozone training data (window size 10).

The forecasting capabilities of the model are displayed in Figure

The proposed model predictions for the ozone data (window size 10).

The optimal weights are

The evolution of MSE when

Table

The errors of the model for the ozone data (window size 10).

MSE on original data | MSE on normalized data | ||
---|---|---|---|

Training | Normal NN | 0.203 | |

Exponential NN | 0.222 | ||

Stack | 0.201 | ||

Testing | Normal NN | 1.328 | |

Exponential NN | 1.238 | ||

Stack | 1.312 |

The behavior of the model on this time series is very similar to that of the lynx time series, regarding both the change in the weights and the comparison of training and testing errors.

This data series contains the index of industrial production in the United Kingdom, from 1700 to 1912 [

We first consider the performance of the model on the training set with a window size of 5 points and 21 points ahead, as displayed in Figure

The proposed model performance on the UK industrial production training data (window size 5).

The forecasting capabilities of the model are shown in Figure

The proposed model predictions for the UK industrial production data (window size 5).

The evolution of the mean square error of the stack on the testing data is displayed as a function of the normal NN weight,

The evolution of MSE when

Table

The errors of the model for the UK industrial production data (window size 5).

MSE on original data | MSE on normalized data | ||
---|---|---|---|

Training | Normal NN | 1.185 | |

Exponential NN | 0.988 | ||

Stack | 0.988 | ||

Testing | Normal NN | 296.895 | |

Exponential NN | 9.810 | ||

Stack | 9.810 |

When the size of the window is increased to 10 points, the performance of the model on the training set is the one shown in Figure

The proposed model performance on the UK industrial production training data (window size 10).

The prediction capabilities of the model are displayed in Figure

The proposed model predictions for the UK industrial production data (window size 10).

Just like in the previous case, the optimal weights are

The evolution of MSE when

Table

The errors of the model for the UK industrial production data (window size 10).

MSE on original data | MSE on normalized data | ||
---|---|---|---|

Training | Normal NN | 0.861 | |

Exponential NN | 0.862 | ||

Stack | 0.862 | ||

Testing | Normal NN | 319.766 | |

Exponential NN | 7.264 | ||

Stack | 7.264 |

Unlike the previous problems, the exponential nature of this time series makes it difficult for a normal neural network. Therefore, the exponential network dominates the stack, independent of the size of the window. One can notice that although the normal network can approximate the training set fairly well, with errors comparable to those of the exponential network, there is a clear difference in performance for the prediction phase, where only the exponential network can find a good trend for the time series.

In order to test the limits of our model, we devised a function given by the following equation:

Since the function is super-exponential, and the activation function of the second neural network is only exponential, it is expected that our stack model will learn the training data well, but will fail to extrapolate to the prediction set. This drawback can be compensated by allowing different activation functions, such as a double-exponential function

With a window size of 5 points and considering 21 points ahead, the performance of the model on the training set is displayed in Figure

The proposed model performance on the super-exponential growth training data (window size 5).

The forecasting capabilities of the model are displayed in Figure

The proposed model predictions for the super-exponential growth data (window size 5).

The evolution of the mean square error of the stack on the testing data is displayed as a function of the normal NN weight,

The evolution of MSE when

Table

The errors of the model for the super-exponential growth data (window size 5).

MSE on original data | MSE on normalized data | ||
---|---|---|---|

Training | Normal NN | 2.825 | |

Exponential NN | 1.593 | ||

Stack | 1.593 | ||

Testing | Normal NN | 826.936 | |

Exponential NN | 172.645 | ||

Stack | 172.645 |

When size of the window is increased to 10 points, the performance of the model on the training set is that shown in Figure

The proposed model performance on the super-exponential growth training data (window size 10).

The forecasting capabilities of the model are displayed in Figure

The proposed model predictions for the super-exponential growth data (window size 10).

In a similar way to the case with a window size of 5, the optimal weights are

The evolution of MSE when

Table

The errors of the model for the super-exponential growth data (window size 10).

MSE on original data | MSE on normalized data | ||
---|---|---|---|

Training | Normal NN | 5.270 | |

Exponential NN | 1.047 | ||

Stack | 1.047 | ||

Testing | Normal NN | 939.119 | |

Exponential NN | 76.707 | ||

Stack | 76.707 |

This time series poses similar problems as the previous one. The only difference is that the super-exponential nature of the proposed function exceeds the prediction possibilities of the exponential network. For this kind of problems, other types of activation functions can be used. The stacked model proposed here is flexible enough to accommodate different types of neural networks, with different activation functions.

We compare the model fitting performance of our stack neural network with several other models, using the implementations in the Statistical Analysis System (SAS) 9.0 software package [

Comparative performance of the proposed model with other forecasting models.

Time series | MSE of the stacked neural network with a window size of 5 | MSE of the stacked neural network with a window size of 10 | MSE of the best SAS model | Name of the best SAS model |
---|---|---|---|---|

Sunspots | 131.865 | 549.21 | Simple exponential smoothing | |

Lynx | 401,015.900 | 1,410,768 | Simple exponential smoothing | |

Ozone | 0.704 | 1.079 | Simple exponential smoothing | |

UK Industrial Production | 0.988 | 1.339 | Log random walk with drift | |

Super-Exponential Growth | 1.593 | 1.047 | Double exponential smoothing |

The best error for a problem is shown in bold letters. It can be seen that our model outperforms the models implemented in SAS for all but the last benchmark problems. The reason for the greater error for the super-exponential growth problem is the inherent limitation of choosing an exponential instead of a super-exponential activation function.

Despite its simplicity, it seems that the stacked hybrid neural model performs well on a variety of benchmark problems for time series. It is expected that it can have good results for other important problems that show dynamical and predictive aspects. The model can be easily extended to incorporate other activation functions that can be suitable for a particular problem, such as a double-exponential function

This work was supported in part by CNCSIS grant code 316/2008,