Reconstruct the Support Vectors to Improve LSSVM Sparseness for Mill Load Prediction

The sparse strategy plays a significant role in the application of the least square support vector machine (LSSVM), to alleviate the condition that the solution of LSSVM is lacking sparseness and robustness. In this paper, a sparse method using reconstructed support vectors is proposed, which has also been successfully applied tomill load prediction.Different fromother sparse algorithms, it no longer selects the support vectors from training data set according to the ranked contributions for optimization of LSSVM. Instead, the reconstructed data is obtained first based on the initial model with all training data. Then, select support vectors from reconstructed data set according to the location information of density clustering in training data set, and the process of selecting is terminated after traversing the total training data set. Finally, the training model could be built based on the optimal reconstructed support vectors and the hyperparameter tuned subsequently. What is more, the paper puts forward a supplemental algorithm to subtract the redundancy support vectors of previous model. Lots of experiments on synthetic data sets, benchmark data sets, and mill load data sets are carried out, and the results illustrate the effectiveness of the proposed sparse method for LSSVM.


Introduction
The ball mill pulverizing system is widely applied to largeand-medium scale power plant in China, which almost uses tubular ball mills to pulverize the coal.A lot of design parameters and unmeasured operating parameters impact the pulverizing circuits.Mill load is an important unmeasurable parameter, which is closely related to the energy efficiency of the pulverizing system.Much research has been presented to measure the mill load in the past decades [1,2].The soft sensing technique is a new and low cost method, which selects measurable information to estimate the mill load by building models.The most common of measurable parameters are the noise and vibration data [3,4].But how to build a forecasting model which plays a significant role in soft sensing technique is a significant problem.So far, researchers have come up with many methods to build the soft sensing model, such as neural networks [5], the support vector machines (SVM) [6], partial least squares [7], and least squares support vector machine (LSSVM) [8]; these methods are aimed at specific problem.In this paper, we mainly research the mathematical problems of LSSVM for building the soft sensing model.LSSVM, as proposed by Suykens, have been introduced for reducing the computational complexity of SVM.In LSSVM, the inequality constraints are replaced with equality constraints in solving a quadratic programming.Thus it has faster speed than SVM in the training process.However, there exist two main drawbacks in LSSVM, as its solution suffers from lack of sparseness and robustness [9].These problems will increase the training time and reduce the model prediction accuracy for the real industrial data sets, which have troublesome characters such as imbalanced distribution, heteroscedasticity, and the explosion of data.Therefore, this paper focuses on the mathematic problem of how to improve LSSVM sparseness and robustness for real industrial data set and applies it to mill load prediction.
Many efforts have been made to mitigate these shortcomings.For example, Suykens et al. introduced a pruning algorithm based on sorted support value spectrum and proposed a sparse Least Squares Support Vector Classifier, which gradually removes the training samples with the smallest absolute support values and retrains the reduced network.Later, this method was extended to the problem of Least Square Support Vector Regression [10].Meanwhile, weighted LSSVM have been presented to improve the robustness of LSSVM solution to better [9].From that point, the sparse algorithms based on strategy of pruning were popping up.Kruif and Vries presented a more sophisticated mechanism of selecting support vectors [11], in which the training sample introducing the smallest approximation error when it is omitted will be pruned.For more on LSSVM pruning algorithms, Hoegaerts et al. [12] provided a comparison among these algorithms and concluded that pruning schemes can be divided into QR decomposition and searching feature vector.Instead of determining the pruning points by errors, Zeng and Chen [13] introduced the sequential minimal optimization method to omit the datum that will lead to minimum changes to a dual objective function.Based on kernel partial least squares identification, Song and Gui [14] presented a method to get base vectors via reducing the kernel matrix by Schmidt orthogonalization.
Generally speaking, the algorithms mentioned above all follow backward pruning strategy.Correspondingly, the methods of forward selecting support vector iteratively recently are used for the sparseness.Yu et al. [15] provided a sparse LSSVM based on active learning, which greedily selected the datum with the biggest absolute approximation errors.Jiao et al. [16] introduced a fast sparse approximation scheme for LSSVM, which picks up the sample with making most contribution to the objective function.Later, an improved method [17] based on partial reduction strategy was extended to LSSVR.Subsequently, a recursive reduced least square support vector machine (RRLSSVM) and an improved algorithm of RRLSSVM (IRRLSSVM) were propose [18,19].They all choose the support vector which leads to the largest reductions on the objective function.However, the difference between them is that IRRLSSVM update the weights of the selected support vectors during the selection process, and RRLSSVM is not so.Additionally, RRLSSVM has been applied for online sparse LSSVM [20].
Backward algorithms need higher computational complexity since the full-order matrix is gradually decomposed into submatrix which leads to minimal increment for objective function [21].Instead, forward algorithms need small computational complexity and small amount of memory required, but convergence of these algorithms has not been proved.In addition, there are some methods to sparse LSSVM from other aspects, for example, based on genetic algorithms [22,23] and compressive sampling theory [24].
For aforementioned sparse algorithms, the hyperparameters optimized under the original data set remain the same in the process of greedy learning, in which almost all employ the radial basis function (RBF).In other words, the process of greedy learning can be considered as the parameter selects the support vector, because kernel function RBF has a local characteristic, with the measurement of the Euclidean distance between data sets.Therefore, there is another perspective to think about the sparseness.No matter what kind of algorithm, the initial model with all training data will be obtained in advance.And what we finally wanted is to reconstruct the model with the least support vectors and hold nearly approximation accuracy with the initial model.Hence, we attempt to realize the sparseness of LSSVM by reconstructing the support vectors to revert the initial models, and the refracturing strategy corresponds to the parameters of RBF.The reconstructed least square support vector machine for regression problems was proposed, abbreviated RCLSSVR.Moreover, the most noticeable innovation is to analyze the features of industrial data sets and introduce RCLSSVR to improve sparseness and robustness simultaneously.There are some features in different industrial data sets, such as imbalanced data distribution and heteroscedasticity.That would lead to a big difference in the process of sparseness due to cut or added datum with different position, which caused the iterative algorithm to choose more support vectors in order to gain the robustness.So we reconstruct the support vectors according to the initial model and the location of the original data; the problem of robustness and sparseness will be solved simultaneously.
This paper is organized as follows.In Section 2, the preliminaries knowledge is briefly introduced, including the fundamental of LSSVM and the principle of reduced LSSVM.The characteristic of real industrial conditions and our proposed algorithm will be developed in Section 3. Some simulations are taken on some function approximation problems and benchmark data sets in Section 4. Finally, the paper is concluded by Section 5. To facilitate reading, we have made a list of abbreviations for some necessary abbreviations in the table of Abbreviations.

Normal LSSVM and the Reduced LSSVM
Given a training data set {  ,   }  =1 , where   ∈   is the input with -dimension and   ∈  is its corresponding target.The goal of function approximation is to find the underlying relation between the input and the target value.Once this relation is found, the outputs corresponding to the inputs that are not contained in the training set can be approximated.
In the LSSVM, the relation underlying the data set is represented as a function of the following form: where  is a mapping of the vector x to a high dimensional feature space,  is the bias, and  is a weight vector of the same dimension as feature space.The mapping (x) is commonly nonlinear and makes it possible to approximate nonlinear functions.Mappings that are often used result in an approximation by a radial basis function, by polynomial functions, or by linear functions [25].
The approximation error for sample is defined as follows: The optimization problem is to search for those weights that give the smallest summed quadratic error of the training samples in LSSVM.The minimization of the error together with the regularization is given as with equality constraint here  is the regularization parameter; this constrained optimization problem is solved by introducing to an unconstrained Lagrangian function: where   ∈  is the Lagrange multiplier of   .The optimum can be found by setting the derivatives equal to zero Eliminating the variables  and   , we can get the following linear equations: where , Ω is a square matrix, Ω  = (x  )  (x  ) = K(x  , x  ), ,  = 1, . . ., , and K is the kernel function which satisfies Mercers condition.It used to calculate the mapping in input space, instead of the feature space.In LSSVM, we can use linear, polynomial, and RBF kernels and others that satisfy Mercers condition.In this paper we focus on the RBF kernel as follows: The solution of this set of equations results in a vector of Lagrangian multipliers  and a bias .The output of the approximation can be calculated for new input values of  with  and .The predicting value is derived as follows: For the sparse LSSVR, the pruning or reduced strategy is applied to the solution, which let  = ∑ ∈   (  , ⋅), where  is the index subset of {1, . . ., }.Take  into (3) and get the following equation: where K  is   = (  ,   ), ,  ∈ , and  S is the subset of  with the index subset .Reformulate (10); we can obtained min  (  , ) here K is   = (  ,   ),  ∈ ,  ∈ .0 and 1 is an appropriate vector.The solution is given by derivative of (11) with respect to  and   .
So the reduced LSSVR is found: Comparing ( 9) and ( 13), we can know that the subset of training sample makes contribution to the model rather than every training sample.Therefore, the computational complexity and the operation time will be decreased in the prediction.

Reconstruct Least Square Support Vector Regression
3.1.The Characteristic of Industrial Data Set.In the real industrial process, there are various reasons resulting in the cases of imbalanced data distribution and heteroscedasticity.Thus both the features need to be briefly introduced.The first feature is the imbalanced distribution, which commonly exists in classification domains, such as spotting unreliable telecommunication customers, text classification, and detection of fraudulent telephone calls [26,27].Certain solutions at the data and algorithmic levels are proposed for the classimbalance problem [28][29][30].It is also a problem in sparse approximation.When adding or deleting a support vector from different areas, the change of estimated performance is different, and the samples in the rare areas will be cut off easily in the sparse process.The second feature is the heteroscedasticity.The problem of heteroscedasticity, non-constant error variance, will bring about severe consequences: the variance of the parameter is not the least, and the prediction precision decreases [31].The solution is divided into two categories: the data transformation method and the model of heteroscedastic data [32,33].In estimation function, the prediction precision will be more sensitive to the samples in the regions with high variance.And the number of support vectors in high variance regions will usually be more.At last, the explosion of data, especially containing imbalance and heteroscedastic data set, has been plaguing the machine learning.
Both forward algorithms and backward algorithms have certain shortcomings on industrial data sets.Firstly, for the imbalanced data sets, since the objective function is to balance the upper bound of the maximum points to the hyperplane with the minimum mean square error, the rare samples make a little contribution to the objective function.Hence, the data in region with more samples is easier to be selected than the data in region with rare samples; secondly, for the heteroscedastic data sets, the data in the area of big variance will be selected or left relatively larger number than the data in the area of small variance; finally, for the excessive data sets, the main problem is excessiveness.The training time is unacceptable for backward greedy methods, and the convergence problem will exist in forward greedy algorithms.

RCLSSVR and DRCLSSVR.
In order to solve the aforementioned problem, we proposed an efficient method via reconstructing support vectors to restore the original model.From (8), we can know common kernel function has a local characteristic, equivalent of normalizing Euclidean distance between data sets based on the parameter.Therefore, it is possible to restore the original model by evenly choosing the support vector near the hyperplane and adjusting the parameter .But it is not always stable convergence because of the unknown data distribution, which can be solved by directly selecting the support vector on the original model instead of from the training data set.Thus, according to the position information of training data set and the original model, we can rebuild the selected data set  = {  , ŷ }  =1 , where ŷ is the estimated value calculated by (9).Then the reconstructed samples in the data set  can be selected as the support vectors.
The sparse strategy is different from the forward and backward greedy methods, and the process has two steps.The first step is to select uniform samples by passing through the entire data set . Arbitrarily pick one point as the density center  = {  ,   } from the original data set in the beginning and calculate the Euclidean distance vector Dist by (14).Find out all of samples M within the density center neighborhood radius  and update the density center based on (15).
where   is the index set of M and also the subset of {1, . . ., }, and | ŷ −  | represents the absolute error between predicted value and true ones. is the ratio of maximum distance, deciding how many support samples to select.Dist_max is the maximum distance between two samples of original data set.This means that the datum far away from the density center and near the hyperplane will be selected as the new density center.This selecting process will be terminated when traversing the original data set.The second step is to select support vector {  , ŷ } from data set  with correspondence to the density center in each iteration.Based on the selected support vectors, the regularization parameter and kernel parameter will be optimized again by leave-oneout methods.The realizing flowchart of RCLSSVR is depicted in Algorithm 1.
Output: the support vectors S.
(1) Train original model via solving (7) with the training data set where the hyperparameter (, ) is found by 10-fold cross validation; (2) obtain function estimation ŷ =1 by (9), and get the reconstructed data set  = {  , ŷ }  =1 ; (3) calculate the Euclidean distance matrix max Dist and the radius ; (4) randomly select a density center  = {  ,   }, and add {  , ŷ } from  to S; ( Generally speaking, the rate of convergence and the performance of RCLSSVR are bound up with the parameter .The smaller  can avoid the underfitting problem, which also multiplies the number of support vectors.But the bigger  will reduce the prediction accuracy.There are two situations worth considering after the end of RCLSSVR; one is that the prediction precision does not meet the demands; the other is the data set S is still allowed to prune.Therefore, for the RCLSSVR, we should take remedial measures in two ways.The first is to improve the prediction precision via gradually adding sample to the support vectors from data set , and the data with the smallest approximation error when it is added will be selected.The second is to prune the redundant samples with meeting the precision requirement.We define density indicators Den to decide which sample is omitted.
where  is the index of the support vector S; the value of  is the same as the parameter .In the process of pruning, the sample with the biggest values in the Den will be pruned.The remedial method is named DRCLSSVR, which is described in Algorithm 2.
Algorithm 2 (DRCLSSVR).( 1) Comparing the performance with the set value, if less, go to (5); else go to (2); (2) calculate the density of all support values Den and sort it.Remove sample with the biggest values in the Den; (3) retrain the LSSVM based on the reduced support vectors; (4) go to (2), unless the performance degrades the set value; (5) determine the training sample from  and add the data with the performance most greatly increased after selecting; (6) retrain the LSSVM based on the added support vectors; (7) go to (5), unless the performance reached to the set.

Experimental Results
In order to verify the performance of the proposed RCLSSVR and DRCLSSVR, some kinds of experiments are performed.In Section 4.1, the influence of different parameter  on sparse performance was studied.In Section 4.2, we selected two backward algorithms and one forward algorithm to perform comparative experiments on synthetic data sets and benchmark data sets.In Section 4.3, RCLSSVR is applied to mill load data set.For comparison purpose, all experiments are finished on a platform of Intel Core i5-4460 CPU @3.20 GHz processor with 4.00 GB RAM of windows 7 operation in a Matlab2014a environment, and a toolbox of LS-SVMlab v1.8 from http://www.esat.kuleuven.be/sista/lssvmlab/.The comparison algorithms are normal LSSVM, Suykens pruning algorithm (SLSSVM) [14], backward classic algorithm (PLSSVM) [16], and IRRLSSVM [23].Among them, PLSSVM is expected to perform best, but it is an extremely expensive algorithm.RBF kernel is used in all of the experiments, and the parameter (, ) is optimized by leaving one out cross validation strategy [22].In addition, two performance indexes, that is, rooted mean squared errors (RMSE) and RMSE%, are defined to evaluate these algorithms.
where normal_RMSE is the RMSE of normal LSSVM.

Experiment 1:
The Performance with Different .In this subsection, we will utilize sinc function to investigate the performance of our proposed algorithms with different .
Since the DRCLSSVR is the supplement of RCLSSVR, the experiment only explores the performance of RCLSSVR with different parameters.This proves the influence of parameter in situations where we make more than once time pass through the data set.In other words, until certain number of vectors is reached, we will let  = {  ,   }  =1 when  = .The sinc functions relation between inputs  and outputs  is described as follows: where  ∈ [−10, 10], sampling with the same intervals, total 300 data.And  is Gauss noise whose average value is equal to 0 and the variance is equal to 0.5.For this data set, we randomly select 200 samples as the training data set and the others as the testing data set.
In order to improve the generalization ability, it is necessary to normalize the attributes of training data sets into the closed interval [0, 1] when calculating the Euclidean distance vector Dist.The parameter  reflects the number of support vectors of proportion of training data set, which ranged from 0.1 to 0.3 at 0.05 intervals.The simulation on RCLSSVR is plotted in Figure 1.
As can be seen from Figure 1, the larger parameter  leads to the faster convergence speed.When  is bigger than 0.25, the performance will show fluctuation because the algorithm updates the density center by repeatedly passing through the data set.When the parameter is smaller than 0.2, the velocity of convergence will be too slow in that there is a need for more #SV to make a pass through the training data set.The most attractive characteristic is that the performance will reach the same as the normal LSSVR, when the support vectors are up to a certain number.Considering rapid convergence and good stability, the parameter , in this paper, will be set to 0.2 for all training data sets.
In order to confirm the effectiveness of DRCLSSVR and observe the selected #SV between RCLSSVR and DRCLSSVR, the comparison of sparse result is displayed in Figure 2. It is clear that DRCLSSVR needs less #SV when reaching almost the same generalization performance.It also proves the support vectors of RCLSSVR are redundant.

Experiment 2: The Simulation on Synthetic Data Sets and
Benchmark Data Sets.In this subsection, we compare the performance between different sparse algorithms mentioned above to show the robustness and sparseness of RCLSSVR and DRCLSSVR.The input and output variables of each data set are normalized into closed interval [0, 1].We firstly generate some synthetic data sets according to the properties of real data set to study the heteroscedastic and imbalanced problems.The heteroscedastic data set is generated by the sinc function with two kinds of noise: multiplication noise and heterogeneous variant Gaussian noise.The imbalanced data set also is generated by (20).The difference is that the numbers of samples in different interval are not the same.In addition, the testing data set is directly generated by the sinc function.These data sets are illustrated in Figure 3.
When test data set comes from the original sinc function relative to the original training data set, the accuracy of prediction model is more sensitive.We gradually increase the support vectors to investigate the robustness of RCLSSVR.The results are shown in Figure 4 and Table 1, which can illustrate that the generalization results of RCLSSVR have better robustness than these contrast algorithms.The performance of IRRLSSVM and SLSSVM suffers on heteroscedastic data sets and do well on imbalanced data set.The performance of PLSSVM and RCLSSVR obtains very good results on abovementioned data sets.However, the PLSSVM are highly timeconsuming, especially for large sample size of data.And it is just for comparing the sparse performance in this paper, not as a selected algorithm.The RCLSSVR has good sparse properties as well as suitable time complexity.The most important is that RCLSSVR will have the best performance and robustness after passing through the training data sets, which is little affected when support vectors increase.
Next, we conduct this experiment on some benchmark regression data sets to investigate the sparseness of RCLSSVR and DRCLSSVR, where the data sets Chwirut1 and Nelson are downloaded from http://www.itl.nist.gov/div898/strd/nls/nls_main.shtml;the Housing, Motorcycle, Airfoil_self_noise, and Yacht_hydrodynamics are downloaded from http:// archive.ics.uci.edu/ml/datasets.html; the data sets MPG, Pyrim, and Bodyfat are downloaded from http://www.csie .ntu.edu.tw/∼cjlin/libsvmtools/datasets/.The data sets also contain the feature of heteroscedasticity, imbalance, and high dimension, and the detailed information about these data sets is listed in Table 2.The comparison results for these benchmark data sets are tabulated in Table 3, where the #SV column gives the number of training samples.Due to the performance closer to the value of normal LSSVM as #SV increases, we will compare the number of support vectors when the performance reaches 90% of the normal LSSVM.The forward algorithms will be stopped when the performance exceeds 90%.For the backward algorithms, the process of training will be terminated after the performance becomes less than 90%.Besides, the RMSE is the performance of testing data set, and the parameters  1 and  1 are optimized on original training data set; the parameters  2 and  2 are the final optimization results for the RCLSSVR.From Table 3, we can conclude that the proposed algorithms have better sparseness and robustness than other compared methods.As expected, PLSSVM has outstanding performance on RMSE than SLSSVM, but the training time is too expensive.SLSSVM is relatively unfavorable and performs worse in high dimensionality.IRRLSSVM

Experiment 3:
The Simulation on Mill Load Data Sets.Mill load has not been accurately measured suffering the influence of numerous factors, which is critical in increasing efficiency and energy efficiency of pulverizing system.Paper [34] proposed that the mill load can be represented by mill noise (  ), mill vibration ( V ), mill current (), import-export pressure difference (  ), and export temperature ().The estimation of mill load is formulated as The mill load data set, consisting of 2400 samples, covers three conditions of low load, normal load, and high load.The training and testing sets randomly selected 600 samples and 200 samples, and the model parameters are optimized by the 10-fold cross validation.The comparison results between these sparse algorithms, showed in Figure 5, are coincident with the experimental results on benchmark data sets.RCLSSVR need less #SV to reach the normal performance, with the shorter testing time in the prediction phase.The estimation results of RCLSSVR and DRCLSSVR are illustrated in Figure 6.As can be seen from Figure 6, the testing data set contains different wear conditions, and the model build by RCLSSVR has good estimation performance in all conditions.It is obvious that RCLSSVR considers both sparseness and robustness in the training process and is more conducive to real industrial data sets.The experimental details are tabulated in Table 4. hyperparameter by optimizing the support vector in the training process.DRCLSSVR is proposed for subtracting the redundancy support vectors of RCLSSVR.In order to demonstrate the effectiveness of RCLSSVR and DRCLSSVR, several experiments on function approximation are carried out using sinc data set and benchmark data sets.The experimental results demonstrated that the proposed methods are superior to algorithms compared, not only in the precision of RMSE, but also on the number of support vectors.Finally, RCLSSVR and DRCLSSVR are applied to the mill load prediction and achieve good performance of sparseness and robustness.

Figure 3 :
Figure 3: Three synthetic data sets: (a) synthetic data set one with multiplicative noise  = (1 + ]) × , where ] is random noise whose mean is zero and variance is 0.1; (b) synthetic data set two with heterogeneous variant Gaussian noise, where noise with zero mean and the variance is 0.4 and 1.2, respectively on [−10, 0] and [0, 10]; (c) synthetic data set three is unbalanced data set, which distribute on the interval [−5, 5] about 90% data and 5%, respectively, on [−10, −5] and [5, 10]; (d) the testing data set.The number of training samples and testing samples is 100.

Figure 4 :
Figure 4: Generalization performance of RCLSSVR with different data sets: (a) synthetic data set one; (b) synthetic data set two; (c) synthetic data set three.

Figure 6 :
Figure 6: The estimation output of testing data set.

Table 1 :
Compare the performance, among these algorithms; NLSSVM with 100 data points, the others with 15 #SV.

Table 2 :
Detailed information of benchmark data sets.

Table 3 :
Experimental results on benchmark data sets.

Table 4 :
Experimental results on mill load data set.the scattered and high dimension data set and is more susceptible to the parameter effects.But the testing time is less than other algorithms when having the same #SV as it does not need to build the model.It is very obvious that the fewer #SV, the less the testing time, and different algorithms have different training time.In practical application, the training time and testing time should be kept in the allowable range, because the sparse algorithms are trained out of line and used online.