ℓ p-Norm Multikernel Learning Approach for Stock Market Price Forecasting

Linear multiple kernel learning model has been used for predicting financial time series. However, ℓ 1-norm multiple support vector regression is rarely observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures that generalize well, we adopt ℓ p-norm multiple kernel support vector regression (1 ≤ p < ∞) as a stock price prediction model. The optimization problem is decomposed into smaller subproblems, and the interleaved optimization strategy is employed to solve the regression model. The model is evaluated on forecasting the daily stock closing prices of Shanghai Stock Index in China. Experimental results show that our proposed model performs better than ℓ 1-norm multiple support vector regression model.


Introduction
Forecasting the future values of financial time series is an appealing yet difficult activity in the modern business world. As explained by Deboeck and Yaser [1,2], the financial time series are inherently noisy, nonstationary, and deterministically chaotic. In the past, many methods were proposed for tackling this kind of problem. For instance, the linear models for forecasting the future values of stock prices include the autoregressive (AR) model [3], the autoregressive moving average (ARMA) model [4], and the autoregressive integrated moving average (ARIMA) model [4]. Over the last decade, nonlinear approaches have received increasing attention in financial time series prediction and have been proposed for a satisfactory answer to the problem. For example, Yao and Tan [5] used time series data and technical indicators as the input of neural networks to increase the forecast accuracy of exchange rates; Cao and Tay [6,7] applied support vector machine (SVM) in financial forecasting and compared it with the multilayer back-propagation (BP) neural network and the regularized radial basis function (RBF) neural network; Qi and Wu [8] proposed a multilayer feed-forward network to forecast exchange rates; Pai and Lin [9] invested a hybrid ARIMA and support vector machines model in stock price forecasting; Pai et al. [10] presented a hybrid SVM model to exploit the unique strength of the linear and nonlinear SVM models in forecasting exchange rate; Kwon and Moon [11] proposed a hybrid neurogenetic system for stock trading; Hung and Hong [12] presented an improved ant colony optimization algorithm in a support vector regression (SVR) model, called SVRCACO, for selecting suitable parameters in exchange rate forecasting; Jiang and He [13] introduced local grey SVR (LG-SVR) integrated grey relational grade with local SVR for financial times eries forecasting; and so on.
In comparison with the previous models, SVR with a single kernel function can exhibit better prediction accuracy because it conceives the structural risk minimization principle which considers both the training error and the capacity of the regression model [14,15]. However, the researchers have to determine in advance the type of kernel function and the associated kernel hyper parameters for SVR. Unsuitably chosen kernel functions or hyper parameter settings may lead to significantly poor performance [16,17].
In recent years there has a lot of interest in designing principled regression algorithms over multiple cues, based 2 Computational Intelligence and Neuroscience on the intuitive notion that using more features should lead to better performance and decreasing the generalization error. When the right choice of features is unknown, learning linear combinations of multiple kernels is an appealing strategy. The approach with a optimization process is called multiple kernel learning (MKL). A first step towards a more realistic model of MKL was achieved by Lanckriet et al. [18], who showed that, given a candidate set of kernels, it is computationally feasible to simultaneously learn a support vector machine and a linear kernel combination at the same time. In MKL we need to solve a joint optimization problem while also learning the optimal weights for combing the kernels. Several practitioners have adopted the linear multiple kernels to deal with the practical problems. For example, Rakotomamonjy et al. [19] addressed the MKL problem through a weighted 2-norm regularization formulation and proposed an algorithm, named Simple MKL, for solving this MKL problem. Bach [20] proposed the asymptotic model consistency of the group Lasso. Zhang and Shen [21] presented multimodal multitask learning algorithm for joint prediction of multiple regression and classification variables in Alzheimer's disease. Especially, Chi-Yuan Yeh and his coworkers [22] developed a twostage MKL algorithm by incorporating sequential minimal optimization and the gradient projection method. The new method [22] performed better than previous ones for forecasting the financial time series. Previous approaches to multiple kernel learning (MKL) have promoted sparse kernel combinations to support interpretability and scalability. Unfortunately, sparsity at the kernel level may harm the generalization performance of the learner, therefore 1norm MKL is rarely observed to outperform trivial baselines in practical applications [23]. To allow for robust kernel mixtures that generalize well, the researchers extend 1norm MKL to arbitrary norms, that is, p -norm MKL (1 ≤ p < ∞). For example, Marius Kloft et al. developed two efficient interleaved strategies for p -norm MKL and showed that it can achieve better accuracy than 1 -norm MKL for real-world problems [23]; Francesco Orabona et al. presented a MKL optimization algorithm based on stochastic gradient descent for p -norm MKL, which possessed a faster convergence rate as the number of kernels grows [24].
In this paper, a multiple kernel learning framework is established for learning and predicting the stock prices. We present a regression model for the future values of stock prices, that is, p -norm multiple kernel support vector regression ( p -norm MK-SVR), where 1 ≤ p < ∞. We decompose the optimization problem into smaller subproblem and adopt the interleaved optimization strategy to solve the regression model. Our experimental results show that p -norm MK-SVR performs a better performance.
The rest of this paper is arranged as follows. Section 2 details the processing of the p -norm MK-SVR model construction and describes the algorithm for our regression model. Experimental results are presented in Section 3. Section 4 concludes the paper and provides some future research directions.

p -Norm Multiple Kernel Support Vector Regression.
In this section, the idea of p -norm multiple kernel support vector regression ( p -norm MK-SVR) is introduced formally.
, where x i ∈ R n and y i ∈ R, be the training set. Each y i is the desired output value for the input vector x i . Consider a function φ(x i ) : R n → H that maps the samples into a high, possibly infinite, dimensional space. A regression model is learned from the previous and used to predict the target values of unseen input vectors. SVR is a nonlinear kernel-based regression method which tries to locate a regression hyperplane with small risk in highdimensional feature space [14]. Considering the soft margin formulation, the objective function and constraints for SVR should be solved, as follows: (1) SVR model usually uses a single mapping function φ and hence a single kernel function K. Although the SVR model has good function approximation and generalization capabilities, it is not fit for dealing with a data-set which has a locally varying distribution. For resolving this problem, we can construct a MK-SVR model. Combining multiple kernels instead of using a single one, p -norm MK-SVR model can catch up the varying distribution very well. Therefore we can use the composite feature map φ which has a block structure: to map the input space to the feature space, where d 1 , d 2 , . . . , d M are weights of component functions. Given a set of base kernels K k which correspond the previous feature maps {φ k }(k = 1, 2, . . . , M), linear MK-SVR aims to learn a linear combination of the base kernels as In learning with MK-SVR we aim at minimizing the loss on the training data with respect to the optimal kernel mixture k d k K k in addition to regularizing d to avoid overfitting. The primal can therefore be formulated as Computational Intelligence and Neuroscience 3 Previous research to MK-SVR employs the regularizer of the form ) which can promote sparse kernel mixtures. However, sparsity is not always desirable, since the information carried in the zero-weighted kernels is lost. Therefore we propose to use nonsparse and thus more robust kernel mixtures by employing an p -norm constraint with p > 1, that is, In (3), let d k w k = w k , C = 1/nλ, μ = μλ, and the first equation be divided with λ, then the following p -norm MK-SVR is obtained: An alternative approach previous equations has been considered by studiers. For example, Zien and Ong [25] upperbound the value of the regularizer d 1 and incorporate the regularizer as an additional constraint into the optimization problem. According to this thought, p -norm MK-SVR model (4) can be transformed into the following form: It can be shown (see the Appendix for details) that the dual of (5) is where are found by solving (6), the regression hyperplane for p -norm MK-SVR model is given by where In the following section, an efficient algorithm is proposed for solving the optimization problem (6). (6) can be trained with several algorithms, for example, the Sequential Minimal Optimization algorithm [26] and multi-kernel learning with online-bath optimization [24]. In this paper, the interleaved optimization is used for the optimization scheme according to the idea of [23]. As a matter of fact, we can exploit the structure of p -norm MK-SVR cost function by alternating between optimizing the linear combination of the base kernels K = k d k K k and the remaining variables as α and α. We can do so by setting up a two-stage optimization algorithm. The basic idea of the algorithm is to divide the optimization variables of p -norm MK-SVR problem (6) into two groups, ( α, α) on one hand and d = (d 1 , d 2 , . . . , d M ) on the other. Our procedure will alternatingly operate on those two stages via a block coordinate descent algorithm. Therefore the optimization d will be carried out analytically and the ( α, α) will be computed in the dual. The two stages are iteratively performed until the specified stopping criterion is met, as shown in Figure 1.

An Optimistic Algorithm. p -norm MK-SVR model
In the first stage, the variables ( α, α) are kept fixed, that is, the ( α, α) are known. Then the optimal d in p -norm MK-SVR model (6) can be calculated analytically by the following process.
In the second stage, the following algorithm is used. We give a chunking-based training algorithm (Algorithm 1) via analytical update for p -Norm MK-SVR. Kernel weighting d and ( α, α) are optimized in an interleaving way. The basic idea of this algorithm is to divide the optimal problem into an inner subproblem and an outer subproblem. The algorithm alternates between solving the two subproblems until convergence.

Algorithm 1
In every iteration process, the inner subproblem ( α and α step) identifies the constraint that maximises (6) with fixing kernel weighting d. The outer subproblem (d step) is also called the restricted master problem. d k is computed with the (10), k = 1, 2, . . . , M.
The interleaved optimization algorithm is depicted in Algorithm 1, and the details of it are as follows.

Chunking and Carrying out with SVR.
In the iteration process, the procedure is standard in chunking-based SVR solvers and is carried out by SVM light , where Q is chosen as described in [28]. We implement the greedy second-order working set selection strategy of [28]. Rather than compute the gradient repeatedly, we speed up variable selection by caching, separately for each kernel. The cache needs to be updated every time we change α Q and α Q in the reduced variable optimisation. In Algorithm 1, (4) and (5) compute the objective values of SVR. Finally, the analytical value of d is carried out in (10).

Experimental Results
In this section, two experiments on a real financial time series have been carried out to assess the performance of pnorm MK-SVR. The motivation behind the two experiments are to compare the performance of our proposed method with that of other methods, that is, single kernel support vector regression (SKSVR) [29] and 1 -norm MK-SVR [22]. All calculations are performed with programs developed in MATLAB R2010a.  Table 1. According to [29], we can derive training patterns (x t , y t ) based on the original daily stock closing prices P = {p 1 , . . . , p t . . .} for SKSVR and p -norm MK-SVR. Let EMA n (t) = EMA n (t−1)+α×(p t −EMA n (t−1)) be the n-day exponential moving average of the tth day, where p t is the tth day daily stock closing prices and α = 2/(n + 1), then the output variable y t can be defined as Let x t = (x t,1 , x t,2 , x t,3 , x t,4 , x t,5 ) be the input vector and let RDP −n (t) = (100 × (p t − p t−n ))/ p t−n be the lagged relative difference in percentage of price (RDP). Moreover, We can obtain a transformed closing price E WA n (t) by subtracting a n-day EMA from the closing price, that is, 6 Computational Intelligence and Neuroscience  Based on in the previously mentioned, the input variables can be defined as x t,1 = E WA 15 , and x t,3 = R DP −20 (t − 5). We adopt the root mean squared error (RMSE) for performance comparison, that is, where y t and y t are desired output and predicted output, respectively.
There are three parameters that should be determined in advance for SKSVR, that is, C, ε, and γ for using RBF kernel. The forecasting performance of SKSVR is examined with C = 1 and ε = 0.005. Because the forecasting performance obtained by SKSVR is effected by the parameter γ, we try with different settings of it from 0.01 to 3 with a stepping factor of 0.05. Figure 2 shows the RMSE for performance on the three data sets by SKSVR. The figure shows that SKSVR requires different γ settings for different data sets to obtain the best performance. For example, the best performance for data 1 occurs when 0.35 ≤ γ ≤ 0.45. The best RMSE values obtained by SKSVR are listed in Table 2.
For p -norm MK-SVR training model, we adopt RBF kernel K(x, x k ) = exp{− x i − x j 2 2 /σ 2 }. A kernel combining 60 different RBF kernels is considered, that is, 0.01 ≤ 1/σ 2 ≤ 3 with step 0.05. Hence, the kernel matrix is combined with a weighted sum of 60 kernel matrices, that is, K = d 1 K 1 + d 2 K 2 + · · · + d 60 K 60 where d 1 denotes the kernel weight for the first kernel matrix with 1/σ 2 = 0.01 and d 2 denotes the kernel weight for the second kernel matrix with 1/σ 2 = 0.06, and so on. For the three data sets, the RMSE values obtained by p -norm MK-SVR are listed in Table 2, too. Obviously Table 3: The data sets for the second experiment.

Dataset
Training  when p = 1.05, 1.001, and 1.15, p -norm MK-SVR model performs better than SKSVR one for data1 data set, data2 data set, and data3 data set, respectively.

Experiment II.
Secondly, we compare the performance of p -norm MK-SVR with that of 1 -norm MK-SVR. In this experiment, the daily stock closing prices of Shanghai Stock Index in China for the period of January 2008 to December 2011 are used, and the training/validating/testing data set is generated by a one-season moving-window testing approach. Following the way done in Tay and Cao [29], three data sets, D-I to D-III, are formed. The corresponding time periods for D-I to D-III are listed in Table 3. We also adopt RMSE (13) for performance comparison. For 1 -norm MK-SVR and p -norm MK-SVR training model, a kernel combining 40 different RBF kernels is considered, that is, 1/σ 2 ∈ {0.01, 0.02, . . . , 0.09, 0.1, 0.2, . . . , 0.9, 1, 2, . . . , 9, 10, 20, . . . , 100, 200, 300, 400}. Hence, the kernel matrix is combined with a weighted sum of 40 kernel matrices, that is, K = d 1 K 1 + d 2 K 2 + · · · + d 40 K 40 where d 1 denotes the kernel weight for the first kernel matrix with 1/σ 2 = 0.01 and d 2 denotes the kernel weight for the second kernel matrix with 1/σ 2 = 0.02, and so on. For the three data sets, the RMSE values obtained by 1 -norm MK-SVR and p -norm MK-SVR are listed in Table 4. Obviously when p = 6/5, 4/3, and 8/7, p -norm MK-SVR model performs better than 1 -norm MK-SVR one for D-I data set, D-II data set, and D-III data set, respectively. Figure 3 shows the forecasting results for D-I and D-II by the two regression models.
Furthermore, we can use a statistical test proposed by Diebold and Mariano [30] to assess the statistical significance of the forecasts by p -norm MK-SVR model. The lossdifferential series of 1 -norm MK-SVR and p -norm MK-SVR are shown in Figures 4 and 5. According to [30], we adopt the asymptotic test S 1 = d/ (2π f d (0))/T as the test statistic, where d i = r 2 1i − r 2 2i is the loss-differential series of 1 -norm MK-SVR and p -norm MK-SVR models, r 1  2π , and 1 * (τ/S(T)) is the lag window, defined as where S(T) = k −1; k reports the number of forecasting steps ahead. We denote U 1 as the forecasting accuracy of 1 -norm MK-SVR and U p as the forecasting accuracy of p -norm MK-SVR. Under the null hypothesis: U 1 = U p , the test was performed at the 0.05 and 0.10 significant levels [12]. The test results are shown in the following Table 5. For the three  We briefly mention that the superior performance of pnorm MK-SVR model (p > 1) is not surprising. When we use the sparsity-inducing norm (p = 1), some of the kernel weights are forced to become zero, and the corresponding kernel will be eliminated leading to some information loss. The daily stock closing prices do not carry large parts of overlapping information, and the information is discriminative. So a nonsparse kernel mixture can access more information and perform more robustly.

Summary and Prospect
In this paper, an p -norm MK-SVR model for stock market price forecasting is proposed. The model conceives an optimization scheme of unprecedented efficiency and provides a really efficient implementation. In an empirical evaluation, we show that p -norm MK-SVR can improve predictive accuracies on relevant real-world data sets. Although we focus on volatility forecasting of stock markets in this paper, our p -norm MK-SVR model could be applied to more general financial forecasting problems. Therefore in the future we will apply our p -norm MK-SVR model for other financial markets, such as exchange markets.