Fast Prediction with Sparse Multikernel LS-SVR Using Multiple Relevant Time Series and Its Application in Avionics System

Health trend prediction is critical to ensure the safe operation of highly reliable systems. However, complex systems often present complex dynamic behaviors and uncertainty, whichmakes it difficult to develop a precise physical predictionmodel.Therefore, time series is often used for prediction in this case. In this paper, in order to obtain better prediction accuracy in shorter computation time, we propose a new scheme which utilizes multiple relevant time series to enhance the completeness of the information and adopts a prediction model based on least squares support vector regression (LS-SVR) to perform prediction. In the scheme, we apply two innovative ways to overcome the drawbacks of the reported approaches. One is to remove certain support vectors by measuring the linear correlation to increase sparseness of LS-SVR; the other one is to determine the linear combination weights of multiple kernels by calculating the root mean squared error of each basis kernel. The results of prediction experiments indicate preliminarily that the proposed method is an effective approach for its good prediction accuracy and low computation time, and it is a valuable method in applications.


Introduction
Over the past decade, many diagnosis methodologies have been applied successfully in the domain of the reliability and safety of systems [1][2][3][4][5][6][7].However, with the development of prognosis and health management (PHM), health condition diagnosis and prognosis technology has become the key technique for enhancement of reliability and safety of complex systems, especially for high reliability systems.However, these complex systems often exhibit multiple dynamic evolution behaviors, multilevel structures, and incomplete or uncertain information.Therefore, it is difficult to establish an accurate physical model.The time series analysis methods are often utilized to analyze and predict their health conditions in practice [7][8][9][10].
Numerous methods have been reported to solve these problems, such as artificial neural networks (ANN) and support vector regression (SVR) [5,6,8,[10][11][12][13][14].ANN has been used in several fields due to its universal approximation property.However, it may fall into local minimum traps and there is difficulty in determining the hidden layer size and the learning rate [15][16][17].On the other hand, SVR aims at the global optimum and exhibits better prediction accuracy due to its implementation of the structural risk minimization principle [18,19].
SVR is an effective methodology for handling function estimation problems [13,14,20].SVR is performed endogenously as a part of the optimization problem which has a unique solution of a (convex) quadratic programming (QP) problem.The solution of the QP problem provides the necessary information for choosing the most important data points, known as support vectors, among all the sample data.Based on the SVR formulation, support vectors uniquely define the estimated regression function.
The application of SVR is initially developed for performing linear regression.The technique called the kernel trick has been extended to handle nonlinear regression applications [18][19][20].Firstly, a kernel function () is used to 2 Mathematical Problems in Engineering map the input data  into an arbitrarily high dimensional feature space.Secondly, the linear SVR can be applied to create an approximate linear function in this arbitrarily high dimensional feature space.This way, nonlinear regression in the low dimensional input space corresponds to linear regression in the high dimensional feature space [21].
The complexity of SVR depends not only on the input space dimension, but also on the number of sample data points.Therefore, when the sample data size is large, the quadratic programming (QP) problem becomes more complex, which will cost a lot of computation time.For this reason, least squares support vector regression (LS-SVR) was proposed by Suykens et al. [22].In LS-SVR, the inequality constrains are replaced by equality constrains.This way, solving a QP is converted into solving linear equations.This will directly reduce training computational complexity.Thus, the training time of LS-SVR is reduced greatly.
LS-SVR has been demonstrated by simulations and applications that it has better prediction results in most cases [23][24][25][26].Moreover, in order to obtain abundant information on the physical system, researchers proposed multiple relevant time series prediction methods which achieve better prediction results than single time series prediction [27][28][29].In the practice of health trend fast prediction for high reliability systems via LS-SVR with multiple relevant time series, we also need to overcome some serious drawbacks in reported methods as described below.
(1) Because all the training sample data are selected as support vectors, this leads to redundant information in the fitting of LS-SVR and lack of sparseness [30].This drawback may influence the generalization capability of LS-SVR as well as the training complexity, which will increase prediction time, especially for multiple relevant time series.
(2) The information contained in the sample data is very often incomplete and complex.Thus, LS-SVR with a single kernel may not be sufficient to represent the information contained in data, and it cannot solve complex problems satisfactorily, especially for prediction with multiple relevant time series sample data.Moreover, how to select appropriate kernel functions remains unsolved in theory.
Aiming at multiple relevant time series fast prediction with LS-SVR, we develop a new scheme to overcome the drawbacks and obtain better prediction performance, that is, accurate prediction results and fast prediction simultaneously.In this paper, combining the merits of LS-SVR with a single kernel for multiple relevant time series [27] and LS-SVR with multikernels for single time series [31], we propose a new fast prediction method to improve the prediction performance with multiple relevant time series and multiple basis kernels.This method provides a regression scheme which integrates multiple kernel learning and enforces sparseness from the training sample data.It includes two aspects.
(1) We apply a simple computational method to determine the combination weights of multiple basis kernels by calculating the root mean squared error (RMSE) of prediction with each basis kernel.
(2) We propose to prune the support vectors by judging the linear correlation among the samples in the high dimensional feature space.This scheme makes the information included in the sample data to be amply represented, reduces the number of support vectors, enhances the generalization performance, and shortens the prediction time with better or acceptable prediction accuracy.
The remainder of the paper is organized as follows.Section 2 gives a brief introduction of LS-SVR; Section 3 proposes the scheme in detail which includes pruning support vectors of LS-SVR and computing combination coefficients of multikernel, and then the fast prediction steps are presented; Section 4 shows three experiments and results analysis; and conclusions are given in Section 5.

A Brief Introduction of LS-SVR
Given the labeled data set  = {(  ,   ),  = 1, 2, . . ., }, where   ∈   is the input sample and   ∈  is the corresponding output label, the main idea of regression based on support vector machines theory is to map the input data to a higher dimensional feature space by means of a nonlinear mapping  :   →  and solve the linear regression model shown as follows: where ŷ is the estimation of   ,  is the regression function,  is the coefficient value, and  is the bias term.The rationale behind the support vector regression (SVR) is to find  and  that optimize the generalization ability of the regressor by minimizing the regularized loss function.The case of SVR is also a convex quadratic programming (QP) problem, and it has a unique optimal solution.But the computation process is complex.So, many researchers proposed modifications of SVR.
The LS-SVR method is a variant of the SVR [11].Compared with SVR, LS-SVR preserves the following characteristics: the error variable   is used to control deviations from the regression function and a squared loss function is used instead of the insensitive loss function.According to the approach of LS-SVR, the model estimate is given by the following optimization problem: The above formulation is nothing but a regression cost function formulated in the feature space defined by the mapping function ().Parameter  determines the tradeoff between the model complexity and the goodness of fit to the sample data.In LS-SVR [22], the Lagrangian function is presented as follows: where   is the th Lagrange multiplier.Because (2) is also a convex function, it is obvious that the Slater constraint qualification holds [32].Therefore, the optimal solution of (2) satisfies the KKT conditions.The optimality conditions are shown as follows: After eliminating   and  from ( 4), the KKT conditions can be expressed as where 1  is an -dimensional vector of all ones, I is a unity matrix, and y = [ 1 ,  2 , . . .,   ]  .Equation ( 5) can be factorized into a positive definite system [33].
Let H = K + I/, and we get the following equations from (5): Multiplying 1   H −1 to both sides of (7), we get According to ( 6), (7), and (8), we have The Lagrange dual variables  and the bias term  are then obtained solely by Any unlabeled input  can be subsequently estimated by the following function: For multiple relevant time series, we can consider all sample data at the same time point as one multidimensional vector, which means the input will become multidimensional vectors.

Proposed Fast Prediction Method for Multiple Relevant Time Series
In order to obtain better prediction performance, we use multiple relevant time series and multiple kernel learning (MKL) to fully utilize the information contained in data.In this section, we propose an improved scheme with multikernel LS-SVR to accomplish fast prediction.This scheme utilizes new approaches to compute the combination coefficients of multikernel and to decrease the number of support vectors.General kernel methods use a single kernel function and choose the same corresponding parameter for the whole sample data set.However, the distribution of the sample data in a different mapping space is different.So, MKL was proposed by Lanckriet et al. [34].

Combination Coefficients of Multiple
MKL is an active research topic in the field of machine learning.It provides a more flexible framework than a single kernel and mines information in data more adaptively and effectively, especially in improving performance of the regression function.
In MKL framework, a combined kernel function is defined as the weighted sum of the individual basis kernels.MKL aims to optimize kernel weights while training the SVRbased methods, such as LS-SVR with multikernel [28,31,35].Though researchers have proposed a variety of methods of integrating multiple kernels from three aspects, which are the composite kernels, the multiscale kernels, and the infinite kernels [35], linear convex combination of basis kernels is still one of the most frequently used methods.For this method, each basis kernel can exploit the full set of features or just use a subset of features.In this paper, using the equations provided by Sonnenburg et al. [36], we consider the combined kernel expressed as follows: where  is the number of basis kernels and   is the combining weight for the th basis kernel.According to the properties of kernel functions [37], matrix  is symmetric positive semidefinite; that is,  ⪰ 0. We need to normalize all kernel matrices   by replacing   (  ,   ) with the following equation to get unit diagonal matrices: Based on MKL, the LS-SVR optimization problem with the new  matrix derived from ( 2)-( 4 ( We then transform the new kernel into the matrix form K = ∑  =1   K  .When the weights   are constrained to be nonnegative and K  are positive semidefinite, the constraint ∑  =1 K  ⪰ 0 is satisfied automatically.In this case, with respect to the corresponding parameters motivated by Lanckriet et al. [34] and Ye et al. [38], the solution of ( 14) can be presented as follows: Using the Schur complement lemma [32,39], (15) can be cast to the form of semidefinite program (SDP), and then we reduce the SDP under strict constraints to a quadratically constrained quadratic program (QCQP).The objective function of QCQP is convex in  and .In other words, the minimization problem is strictly feasible in , and the maximization problem is feasible in  [34,40].Such a QCQP problem can be solved efficiently by the interior point methods.The obtained dual variables can be used to fix the optimal kernel coefficients.
For MKL, with linear convex combination of basis kernels, the key is to obtain the optimal combing weights   .Some researchers proposed to simultaneously optimize both the combing weights   and the parameters of LS-SVR, for example, regularization of the parameter  and the basis kernels parameters and so on.Now, the more common way is to adopt optimization software packages, such as MOSEK [41], which can solve the primal and dual problems simultaneously using the interior point methods.But all these solution methods always require much computation time and are complex for applications in practice.
Because the combination of the multiple kernels is a linear combination in this paper, the realistic application will be taken into full account; here, we propose a simple approach to determine the combination weights of the new kernel by calculating the root mean squared error (RMSE) of prediction using each single basis kernel; that is, a smaller RMSE value will result in a bigger weight value.
That the prediction value has less relative error means that the LS-SVR model with the basis kernel is a better model.Thus, the RMSE of prediction is used as the evaluation criterion.The RMSE of the multiple variables prediction is defined as follows: where  is the number of relevant variables,  is the number of the original training sample data points, and   () and ŷ () are the prediction and the actual values, respectively.The combination weights   of the multiple kernels are computed as follows: where   is the prediction RMSE of the th kernel, ∑  =1   is the sum RMSE of all basis kernels, and ∑  =1   −   presents the contribution of the th kernel.

Linear Correlation-Based Method of Pruning Support
Vectors.LS-SVR has a major drawback that is lack of sparseness [42] because almost each sample data point will be a support vector.In order to get a sparse solution, several methods have been proposed.For example, Suykens et al. [43] proposed a pruning scheme based on support vector spectrum that prunes those support vectors with smaller Lagrange multiplier values.de Kruif and de Vries [42] introduced a procedure of pruning according to the smallest approximation error when the support vector is omitted.Hoegaerts et al. [44] improved the method proposed by Kruif et al.They suggested a variant that improved the performance significantly and assessed its relative performance compared with two other subset selection schemes.Keerthi and Shevade [45] extended the well-known SMO algorithm to LS-SVM to solve problems with large training samples.Based on this study, Zeng and Chen [46] proposed a new pruning algorithm for sparse LS-SVR via SMO algorithm.Jiao et al. [47] presented two fast sparse approximation schemes for LS-SVR.
The basic idea of the many approaches reviewed above is to find the support vector with the smaller indicator value and remove the corresponding training sample.It is obvious that the idea is simple, but all the reported schemes require computation duplication in solving linear equations [48,49].For this reason, these schemes are not suitable for fast prediction.Thus, Yaakov et al. [50] and Cawley and Talbot [51] presented a more efficient scheme by evaluating the linear correlation among the samples in the mapping space.The merits of [48][49][50][51] are that the linear correlation-based sparseness strategy ensures the information in the data to be maximally retained and the sparseness prediction model improves the generalization capability to improve the model's adaptability to noisy sample data.So, the proposed method bases on [48][49][50][51] and uses the following idea.
Suppose that (  ) is one of the  support vectors in the high dimensional feature space.If and only if   = 0 ( ̸ = ), the following equation holds: Here,   is called base vector (BV) and the set consisting of   is called base vector set (BVS).In order to obtain sparseness in LS-SVR, any support vector linearly related to the BV can be deleted.Assume that there are  support vectors in the sample data and the BVS consists of  support vectors ( < ).If the high dimensional mapping (  ) of the ( + 1)th sample   cannot be represented by a linear combination of the  support vectors, it will be added into the BV; otherwise, it does not join the BVS.
In many real cases, the new sample data may not be absolutely linearly dependent on the existing support vectors.However, it is very similar to a linear combination of the existing support vectors; that is, for sample   , we will be content with finding coefficients  , with at least one nonzero element satisfying the approximate linear dependence condition where  is a small positive constant.Then, [48,50] By minimizing the left-hand side of (20), we can obtain the linear correlation coefficients   = [ ,1 ,  ,2 , . . .,  , ]  simultaneously.
To sacrifice the optimality of   in return for a reduction in the size of the sample data, a  2 norm regularization term of the form ‖  ‖ 2 is added to the minimization problem defined in (20) where  is a positive constant coefficient.It is obvious that the sparseness level and the model accuracy are controlled by the parameters  and .But no rules have been reported in the literature for the selection of these two parameters.
In this paper, we select  = ‖  ‖ 2 / 2 , and then we propose a new sparseness method, which is described as follows: where  is the same as the LS-SVR parameter .Solving (23) and if   is bigger than ‖  ‖ 2 / 2 , add   into the BVS; otherwise,   is dropped.Following the ideas outlined above, we propose the following approach to prune support vectors.A detailed pseudocode account of the proposed method is given in Pseudocode 1.
Based on the pseudocode, we choose two sample data points corresponding to the two biggest Lagrange multipliers as the initial BV firstly, and then if the new nonlinear mapping (  ) depends linearly on the selected support vectors, it can be dropped; that is, if the new sample data   satisfies the relation formula where   is a constant coefficient and  is the number of independent support vectors, then the new data point is dropped.Combining ( 24) with ( 11), we have where   is the sample corresponding to BV.Then, we will get the new regression function with BV, and it can be described by where   is combined with Lagrange dual variable coefficient and linear combination coefficient.The proposed method combines advantages of removing smaller support vectors and the advantages of [48][49][50][51].It can retain much of the useful information in the sample data.Thus, the prediction accuracy will not be sacrificed too much while the model's generalization ability is improved.Although this processing step would increase the calculation time for the LS-SVR model, the training time will not be increased much because almost each middle parameter's value is already computed and stored in the process of setting up the prediction model, and the prediction time will be reduced because of fewer support vectors.Moreover, the proposed method can avoid a difficult problem in Yaakov's scheme, the parameters selection, because no rules have been reported to fix the problem.Thus, it may be more valuable in practice.

Prediction
Steps of the Proposed Method.By applying the scheme proposed above, we can set up the LS-SVR model with multikernel for multiple relevant time series, called the improved MKLS-SVR (IMKLS-SVR), and we will use this new model to carry out multiple relevant time series prediction.Here, we look at the multiple relevant time series as an integral whole; that is, they are taken as a multidimensional input [52,53], and the prediction is executed simultaneously.
The prediction steps can be described as follows: (1) choose appropriate basis kernels; (2) determine   and then obtain the combination kernel; (3) compute K, H and , ; (4) assess the linear correlation and prune support vectors and compute new BV-based Lagrange multiplier   ; (5) set up the regression function and perform prediction.

Experiments and Result Analysis
In order to examine the prediction efficiency of the proposed method, we provide simulation and application experiments.
All the experiments use MatlabR2011b with LS-SVMlab1.8 Toolbox (the software and guide book can be downloaded from http://www.esat.kuleuven.be/sista/lssvmlab)under Windows XP operating system.

Simulation Experiments and Results
Analysis.The simulation experiments include two parts.One is to test the proposed method based on linear correlation (Experiment 1) and the other is to test the proposed computing method of multikernel combination coefficients (Experiment 2).All tests are repeated 50 times with the results averaged.The two experiments to test the proposed scheme are presented below in detail.In order to test the efficiency of the proposed pruning method, we compare it with the method which Yaakov et al. presented in [50].Here, the kernel is Gaussian RBF kernel (, ) = exp(−‖ − ‖ 2 /2 2 ) with standard deviation  = 1.The other parameters are  = 0.01 and  = 0.001.In this experiment, the training time, prediction time, and prediction mean RMSE are compared.The results are shown in Table 1.
Table 1 shows that the proposed method attains good accuracy with less computation time.In particular, it can reduce the prediction time greatly without losing much prediction accuracy.In addition, this method avoids the problem of parameter selection.selected, respectively, as sample time series.We set the first 80 data points as training samples.We also take any continuous 11 data points of  and  as a sample, where the first 10 data points compose an input sample vector and the last one is the output vector; that is, in this simulation experiment, we have 70 training data points for each variable.And then we predict number 81 to number 100 time series data using the trained model.
In order to test the practicability of the proposed computing method of multikernel combination coefficients, we choose one linear kernel function (, ) =    representing global information and two Gaussian RBF functions ( 2 equal to 5 and 9, resp.)representing local information [54] and establish multikernel LS-SVR, respectively, by two approaches which are the proposed method and the one described in [31].After determining the new combination weights of the basis kernels, we compare the training time, prediction time, and prediction mean RMSE with LS-SVR (with single Gaussian RBF kernel,  2 = 5) and multikernel LS-SVR.The results are shown in Table 2.
The simulation results in Table 2 show that the proposed method can reduce computation time more than the other two methods, although it does not achieve the best prediction results.The proposed method has good prediction accuracy and requires less prediction time.

Avionics System Experiment and Results
Analysis.In this section, we use a certain circuit of avionics system (shown in Figure 1) to demonstrate the efficiency of the proposed method with multiple relevant time series.The circuit is used to compare the prediction time and prediction accuracy with the traditional LS-SVR model described in [25] and LS-SVR with multikernel, called MKLS-SVR, shown in [31].
In order to collect the sample data, we measure voltage values at three measuring points named point A, point B, and point C (with pentagram representations, see Figure 1).The acquisition step is 0.1 s with time unit of 4.5 s.
We take the first 30 measured data points group as the training samples.Again each sample set consists of any continuous 11 data points; that is, we have 20 groups of training data.And then we predict number 31 to number 45 time series data.Here, we look at the time series obtained from point A, point B, and point C (see Figure 1) as an integral whole; that is, they are examined as a multidimensional input and do prediction simultaneously.We run 100 times of prediction and take the average of them.Moreover, all the parameters of kernel functions are jointly optimized with the traditional grid search method, where the search range for  and  2 is [0.01, 2000], and the prediction performance is measured by RMSE.
Gaussian RBF kernel is adopted as the kernel function of traditional LS-SVR.For MKL-SVR and IMKLS-SVR, we also choose one linear kernel function and two Gaussian RBF functions.The prediction results are shown in Figures 2, 3, and 4 (dimensions omitted) and the mean prediction RMSE for three data sets are reported in Table 3.
Figures 2-4 and Table 3 show that (1) compared with traditional LS-SVR, the proposed method, IMKLS-SVR, has good prediction accuracy.This may be due to the fact that the proposed method uses relevant information fully and maps them in multikernel high dimensional space; (2) in spite of almost the same accuracy with MKLS-SVR, IMKLS-SVR requires less prediction time than MKLS-SVR and LS-SVR, especially than LS-SVR.This may be due to the fact that we have reduced the support vectors of sample data by measuring the linear correlation of the sample data.Thus, IMKLS-SVR is more valuable in application.

Conclusions
In this paper, we study the issue of fast prediction with multikernel LS-SVR for multiple relevant time series.We utilize two approaches to overcome the drawbacks of LS-SVR to meet the requirements of good prediction accuracy and less prediction time cost.Firstly, because each time series data point may have different distribution characteristics, we apply MKL, that is, a linear combination of basis kernels, to enhance the generalization capability of the learning machine and to exploit all discriminative information in the sample data.We determine the linear combination weights by calculating the RMSE of each basis kernel to yield the new kernel.Secondly, we remove the samples by judging the linear correlation to reduce the number of support vectors.This approach can control the loss of useful information of the sample data, reduce computation time, and improve the model's generalization ability simultaneously.We conducted several experiments to evaluate the proposed methods; we especially utilized a certain prediction application to demonstrate the effectiveness of the proposed method via comparing with the traditional LS-SVR and multikernel LS-SVR.The results show that the proposed prediction scheme has good prediction performance, and it is suitable for multiple relevant time series fast prediction.
Kernels.The kernel function and the corresponding kernel parameters are the key issues affecting the model prediction accuracy.An effective kernel function should represent sample data adaptively.

Experiment 1 .
We first used LS-SVR to learn the 1dimensional function  = sin()/ defined in the interval  ∈ [0.1, 10].We collect 100 values of  as our time series.The data points from 1 to 60 in the time series are taken as the 50 initial training sample data points.The first sample data set consists of points 1 through 11, with the first 10 as the input sample vector and the 11th point as the output.The second sample data set consists of points 2 through 12, with the points 2 through 11 as the input sample vector and the 12th point as the output.This way we have 50 training data points out of the first 60 data points.

Figure 3 :
Figure3: Prediction results of data collected from point B (see Figure1).

Figure 4 :
Figure4: Prediction results of data collected from point C (see Figure1).

Table 1 :
Prediction results of simulation Experiment 1.

Table 2 :
Prediction results of simulation Experiment 2.

Table 3 :
Prediction results of application experiment.