The sparse strategy plays a significant role in the application of the least square support vector machine (LSSVM), to alleviate the condition that the solution of LSSVM is lacking sparseness and robustness. In this paper, a sparse method using reconstructed support vectors is proposed, which has also been successfully applied to mill load prediction. Different from other sparse algorithms, it no longer selects the support vectors from training data set according to the ranked contributions for optimization of LSSVM. Instead, the reconstructed data is obtained first based on the initial model with all training data. Then, select support vectors from reconstructed data set according to the location information of density clustering in training data set, and the process of selecting is terminated after traversing the total training data set. Finally, the training model could be built based on the optimal reconstructed support vectors and the hyperparameter tuned subsequently. What is more, the paper puts forward a supplemental algorithm to subtract the redundancy support vectors of previous model. Lots of experiments on synthetic data sets, benchmark data sets, and mill load data sets are carried out, and the results illustrate the effectiveness of the proposed sparse method for LSSVM.
National Natural Science Foundation of China61304118Program for New Century Excellent Talents in UniversityNCET-13-0456Doctoral Program of Higher Education of China201302011200111. Introduction
The ball mill pulverizing system is widely applied to large-and-medium scale power plant in China, which almost uses tubular ball mills to pulverize the coal. A lot of design parameters and unmeasured operating parameters impact the pulverizing circuits. Mill load is an important unmeasurable parameter, which is closely related to the energy efficiency of the pulverizing system. Much research has been presented to measure the mill load in the past decades [1, 2]. The soft sensing technique is a new and low cost method, which selects measurable information to estimate the mill load by building models. The most common of measurable parameters are the noise and vibration data [3, 4]. But how to build a forecasting model which plays a significant role in soft sensing technique is a significant problem. So far, researchers have come up with many methods to build the soft sensing model, such as neural networks [5], the support vector machines (SVM) [6], partial least squares [7], and least squares support vector machine (LSSVM) [8]; these methods are aimed at specific problem. In this paper, we mainly research the mathematical problems of LSSVM for building the soft sensing model.
LSSVM, as proposed by Suykens, have been introduced for reducing the computational complexity of SVM. In LSSVM, the inequality constraints are replaced with equality constraints in solving a quadratic programming. Thus it has faster speed than SVM in the training process. However, there exist two main drawbacks in LSSVM, as its solution suffers from lack of sparseness and robustness [9]. These problems will increase the training time and reduce the model prediction accuracy for the real industrial data sets, which have troublesome characters such as imbalanced distribution, heteroscedasticity, and the explosion of data. Therefore, this paper focuses on the mathematic problem of how to improve LSSVM sparseness and robustness for real industrial data set and applies it to mill load prediction.
Many efforts have been made to mitigate these shortcomings. For example, Suykens et al. introduced a pruning algorithm based on sorted support value spectrum and proposed a sparse Least Squares Support Vector Classifier, which gradually removes the training samples with the smallest absolute support values and retrains the reduced network. Later, this method was extended to the problem of Least Square Support Vector Regression [10]. Meanwhile, weighted LSSVM have been presented to improve the robustness of LSSVM solution to better [9]. From that point, the sparse algorithms based on strategy of pruning were popping up. Kruif and Vries presented a more sophisticated mechanism of selecting support vectors [11], in which the training sample introducing the smallest approximation error when it is omitted will be pruned. For more on LSSVM pruning algorithms, Hoegaerts et al. [12] provided a comparison among these algorithms and concluded that pruning schemes can be divided into QR decomposition and searching feature vector. Instead of determining the pruning points by errors, Zeng and Chen [13] introduced the sequential minimal optimization method to omit the datum that will lead to minimum changes to a dual objective function. Based on kernel partial least squares identification, Song and Gui [14] presented a method to get base vectors via reducing the kernel matrix by Schmidt orthogonalization.
Generally speaking, the algorithms mentioned above all follow backward pruning strategy. Correspondingly, the methods of forward selecting support vector iteratively recently are used for the sparseness. Yu et al. [15] provided a sparse LSSVM based on active learning, which greedily selected the datum with the biggest absolute approximation errors. Jiao et al. [16] introduced a fast sparse approximation scheme for LSSVM, which picks up the sample with making most contribution to the objective function. Later, an improved method [17] based on partial reduction strategy was extended to LSSVR. Subsequently, a recursive reduced least square support vector machine (RRLSSVM) and an improved algorithm of RRLSSVM (IRRLSSVM) were propose [18, 19]. They all choose the support vector which leads to the largest reductions on the objective function. However, the difference between them is that IRRLSSVM update the weights of the selected support vectors during the selection process, and RRLSSVM is not so. Additionally, RRLSSVM has been applied for online sparse LSSVM [20].
Backward algorithms need higher computational complexity since the full-order matrix is gradually decomposed into submatrix which leads to minimal increment for objective function [21]. Instead, forward algorithms need small computational complexity and small amount of memory required, but convergence of these algorithms has not been proved. In addition, there are some methods to sparse LSSVM from other aspects, for example, based on genetic algorithms [22, 23] and compressive sampling theory [24].
For aforementioned sparse algorithms, the hyperparameters optimized under the original data set remain the same in the process of greedy learning, in which almost all employ the radial basis function (RBF). In other words, the process of greedy learning can be considered as the parameter selects the support vector, because kernel function RBF has a local characteristic, with the measurement of the Euclidean distance between data sets. Therefore, there is another perspective to think about the sparseness. No matter what kind of algorithm, the initial model with all training data will be obtained in advance. And what we finally wanted is to reconstruct the model with the least support vectors and hold nearly approximation accuracy with the initial model. Hence, we attempt to realize the sparseness of LSSVM by reconstructing the support vectors to revert the initial models, and the refracturing strategy corresponds to the parameters of RBF. The reconstructed least square support vector machine for regression problems was proposed, abbreviated RCLSSVR. Moreover, the most noticeable innovation is to analyze the features of industrial data sets and introduce RCLSSVR to improve sparseness and robustness simultaneously. There are some features in different industrial data sets, such as imbalanced data distribution and heteroscedasticity. That would lead to a big difference in the process of sparseness due to cut or added datum with different position, which caused the iterative algorithm to choose more support vectors in order to gain the robustness. So we reconstruct the support vectors according to the initial model and the location of the original data; the problem of robustness and sparseness will be solved simultaneously.
This paper is organized as follows. In Section 2, the preliminaries knowledge is briefly introduced, including the fundamental of LSSVM and the principle of reduced LSSVM. The characteristic of real industrial conditions and our proposed algorithm will be developed in Section 3. Some simulations are taken on some function approximation problems and benchmark data sets in Section 4. Finally, the paper is concluded by Section 5. To facilitate reading, we have made a list of abbreviations for some necessary abbreviations in the table of Abbreviations.
2. Normal LSSVM and the Reduced LSSVM
Given a training data set xi,yii=1N, where xi∈Rm is the input with m-dimension and yi∈R is its corresponding target. The goal of function approximation is to find the underlying relation between the input and the target value. Once this relation is found, the outputs corresponding to the inputs that are not contained in the training set can be approximated.
In the LSSVM, the relation underlying the data set is represented as a function of the following form: (1)y^x=ωTφxi+b,where φ is a mapping of the vector x to a high dimensional feature space, b is the bias, and ω is a weight vector of the same dimension as feature space. The mapping φx is commonly nonlinear and makes it possible to approximate nonlinear functions. Mappings that are often used result in an approximation by a radial basis function, by polynomial functions, or by linear functions [25].
The approximation error for sample is defined as follows:(2)ei=yi-y^xi.
The optimization problem is to search for those weights that give the smallest summed quadratic error of the training samples in LSSVM. The minimization of the error together with the regularization is given as(3)minJω,e=12ωTω+γ12∑i=0Nei2 with equality constraint(4)yi=ωTφxi+b+ei; here γ is the regularization parameter; this constrained optimization problem is solved by introducing to an unconstrained Lagrangian function:(5)Jω,e-∑i=0NαiωTφxi+b+ei-yi, where αi∈R is the Lagrange multiplier of xi. The optimum can be found by setting the derivatives equal to zero(6)∂L∂ω=0⟶ω=∑i=1Nαiφxi∂L∂b=0⟶∑i=1Nαi=0∂L∂ei=0⟶αi=γei∂L∂αi=0⟶ωTφxi+b+ei-yi=0,i=1,…,N.
Eliminating the variables ω and ei, we can get the following linear equations:(7)01→T1→Ω+γ-1Ibα=0y,where y=[y1;…;yN], 1→=[1;…;1], 1→=[1;…;1], Ω is a square matrix, Ωmn=φ(xm)Tφ(xn)=K(xm,xn), m,n=1,…,N, and K is the kernel function which satisfies Mercers condition. It used to calculate the mapping in input space, instead of the feature space. In LSSVM, we can use linear, polynomial, and RBF kernels and others that satisfy Mercers condition. In this paper we focus on the RBF kernel as follows:(8)Kx,xi=exp-x-xi2δ2.
The solution of this set of equations results in a vector of Lagrangian multipliers α and a bias b. The output of the approximation can be calculated for new input values of x with α and b. The predicting value is derived as follows:(9)yx=∑i=1NαiKxi,x+b.
For the sparse LSSVR, the pruning or reduced strategy is applied to the solution, which let ω=∑i∈Sαik(xi,·), where S is the index subset of {1,…,N}. Take ω into (3) and get the following equation:(10)minJαS,b=12αSTKSSαS+γ2∑i=1Nyi-∑j∈SαjKxi,xj-b2, where KSS is Kij=K(xi,xj),i,j∈S, and αS is the subset of α with the index subset S. Reformulate (10); we can obtained(11)minJαS,b=bαST00T0KSSγ+1TK×bαS-21TK^SyTbαS; here K^S is Kij=K(xi,xj),i∈S,j∈N. 0 and 1 is an appropriate vector. The solution is given by derivative of (11) with respect to b and αS.(12)00T0KSSγ+1TK^SbαS=1TK^Sy So the reduced LSSVR is found:(13)yx=∑i∈SαiKxi,x+b.
Comparing (9) and (13), we can know that the subset of training sample makes contribution to the model rather than every training sample. Therefore, the computational complexity and the operation time will be decreased in the prediction.
3. Reconstruct Least Square Support Vector Regression3.1. The Characteristic of Industrial Data Set
In the real industrial process, there are various reasons resulting in the cases of imbalanced data distribution and heteroscedasticity. Thus both the features need to be briefly introduced. The first feature is the imbalanced distribution, which commonly exists in classification domains, such as spotting unreliable telecommunication customers, text classification, and detection of fraudulent telephone calls [26, 27]. Certain solutions at the data and algorithmic levels are proposed for the class-imbalance problem [28–30]. It is also a problem in sparse approximation. When adding or deleting a support vector from different areas, the change of estimated performance is different, and the samples in the rare areas will be cut off easily in the sparse process. The second feature is the heteroscedasticity. The problem of heteroscedasticity, non-constant error variance, will bring about severe consequences: the variance of the parameter is not the least, and the prediction precision decreases [31]. The solution is divided into two categories: the data transformation method and the model of heteroscedastic data [32, 33]. In estimation function, the prediction precision will be more sensitive to the samples in the regions with high variance. And the number of support vectors in high variance regions will usually be more. At last, the explosion of data, especially containing imbalance and heteroscedastic data set, has been plaguing the machine learning.
Both forward algorithms and backward algorithms have certain shortcomings on industrial data sets. Firstly, for the imbalanced data sets, since the objective function is to balance the upper bound of the maximum points to the hyperplane with the minimum mean square error, the rare samples make a little contribution to the objective function. Hence, the data in region with more samples is easier to be selected than the data in region with rare samples; secondly, for the heteroscedastic data sets, the data in the area of big variance will be selected or left relatively larger number than the data in the area of small variance; finally, for the excessive data sets, the main problem is excessiveness. The training time is unacceptable for backward greedy methods, and the convergence problem will exist in forward greedy algorithms.
3.2. RCLSSVR and DRCLSSVR
In order to solve the aforementioned problem, we proposed an efficient method via reconstructing support vectors to restore the original model. From (8), we can know common kernel function has a local characteristic, equivalent of normalizing Euclidean distance between data sets based on the parameter. Therefore, it is possible to restore the original model by evenly choosing the support vector near the hyperplane and adjusting the parameter δ. But it is not always stable convergence because of the unknown data distribution, which can be solved by directly selecting the support vector on the original model instead of from the training data set. Thus, according to the position information of training data set and the original model, we can rebuild the selected data set D=xj,y^jj=1N, where y^ is the estimated value calculated by (9). Then the reconstructed samples in the data set D can be selected as the support vectors.
The sparse strategy is different from the forward and backward greedy methods, and the process has two steps. The first step is to select uniform samples by passing through the entire data set D. Arbitrarily pick one point as the density center C=xd,yd from the original data set in the beginning and calculate the Euclidean distance vector Dist by (14). Find out all of samples M within the density center neighborhood radius R and update the density center based on (15).(14)Distd,i=xd-xi2+yd-yi2i∈1,…,N(15)C=argminj∈IMy^j-yjDistd,j(16)M=xi,yi∣Distd,i≤R,i=1,…,N(17)R=r·Dist_max,where IM is the index set of M and also the subset of 1,…,N, and y^j-yj represents the absolute error between predicted value and true ones. r is the ratio of maximum distance, deciding how many support samples to select. Dist_max is the maximum distance between two samples of original data set. This means that the datum far away from the density center and near the hyperplane will be selected as the new density center. This selecting process will be terminated when traversing the original data set. The second step is to select support vector xd,y^d from data set D with correspondence to the density center in each iteration. Based on the selected support vectors, the regularization parameter and kernel parameter will be optimized again by leave-one-out methods. The realizing flowchart of RCLSSVR is depicted in Algorithm 1.
Algorithm 1 (RCLSSVR).
Input: a training data set {xi,yi}i=1N, the radius coefficient r, the unlabeled data set U={xj,yj}j=1N, the support vectors S=ϕ, and the neighborhood data set M=ϕ.
Output: the support vectors S.
Train original model via solving (7) with the training data set where the hyperparameter (δ,γ) is found by 10-fold cross validation;
obtain function estimation y^i=1N by (9), and get the reconstructed data set D={xi,y^i}i=1N;
calculate the Euclidean distance matrix maxDist and the radius R;
randomly select a density center C=xd,yd, and add xd,y^d from D to S;
update the unlabeled data set U=U-M-C, where M and C are update in each iteration;
update the set M=xj,yj∣Dist(d,j)≤R,xj,yj∈U, and if M≠ϕ, choose the next density center according to (15) and add corresponding datum xd,y^d to S; else, select the sample xj,yj∣minDistd,j,xj,yj∈U as the next density center and add corresponding datum xd,y^d to S;
if U=ϕ, then go to next step; else go to step (5);
optimize the parameter δ,γ again via leaving one out cross validation on data set S, and rebuild the model.
Generally speaking, the rate of convergence and the performance of RCLSSVR are bound up with the parameter r. The smaller r can avoid the underfitting problem, which also multiplies the number of support vectors. But the bigger r will reduce the prediction accuracy. There are two situations worth considering after the end of RCLSSVR; one is that the prediction precision does not meet the demands; the other is the data set S is still allowed to prune. Therefore, for the RCLSSVR, we should take remedial measures in two ways. The first is to improve the prediction precision via gradually adding sample to the support vectors from data set D, and the data with the smallest approximation error when it is added will be selected. The second is to prune the redundant samples with meeting the precision requirement. We define density indicators Den to decide which sample is omitted.(18)Deni=∑i,m∈sexp-xi,y^i-xm,y^m2r/22,where s is the index of the support vector S; the value of r is the same as the parameter r. In the process of pruning, the sample with the biggest values in the Den will be pruned. The remedial method is named DRCLSSVR, which is described in Algorithm 2.
Algorithm 2 (DRCLSSVR).
(1) Comparing the performance with the set value, if less, go to (5); else go to (2);
(2) calculate the density of all support values Den and sort it. Remove sample with the biggest values in the Den;
(3) retrain the LSSVM based on the reduced support vectors;
(4) go to (2), unless the performance degrades the set value;
(5) determine the training sample from D and add the data with the performance most greatly increased after selecting;
(6) retrain the LSSVM based on the added support vectors;
(7) go to (5), unless the performance reached to the set.
4. Experimental Results
In order to verify the performance of the proposed RCLSSVR and DRCLSSVR, some kinds of experiments are performed. In Section 4.1, the influence of different parameter r on sparse performance was studied. In Section 4.2, we selected two backward algorithms and one forward algorithm to perform comparative experiments on synthetic data sets and benchmark data sets. In Section 4.3, RCLSSVR is applied to mill load data set. For comparison purpose, all experiments are finished on a platform of Intel Core i5-4460 CPU @3.20 GHz processor with 4.00 GB RAM of windows 7 operation in a Matlab2014a environment, and a toolbox of LS-SVMlab v1.8 from http://www.esat.kuleuven.be/sista/lssvmlab/. The comparison algorithms are normal LSSVM, Suykens pruning algorithm (SLSSVM) [14], backward classic algorithm (PLSSVM) [16], and IRRLSSVM [23]. Among them, PLSSVM is expected to perform best, but it is an extremely expensive algorithm. RBF kernel is used in all of the experiments, and the parameter γ,δ is optimized by leaving one out cross validation strategy [22]. In addition, two performance indexes, that is, rooted mean squared errors (RMSE) and RMSE%, are defined to evaluate these algorithms.(19)RMSE=∑i=1Nyi-y^i2N(20)RMSE%=normal_RMSERMSE×100,where normal_RMSE is the RMSE of normal LSSVM.
4.1. Experiment 1: The Performance with Different <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="M127"><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:math></inline-formula>
In this subsection, we will utilize sinc function to investigate the performance of our proposed algorithms with different r. Since the DRCLSSVR is the supplement of RCLSSVR, the experiment only explores the performance of RCLSSVR with different parameters. This proves the influence of parameter in situations where we make more than once time pass through the data set. In other words, until certain number of vectors is reached, we will let U={xj,yj}j=1N when U=ϕ. The sinc functions relation between inputs X and outputs Y is described as follows: (21)Y=10sinXX+ε,where X∈[-10,10], sampling with the same intervals, total 300 data. And ε is Gauss noise whose average value is equal to 0 and the variance is equal to 0.5. For this data set, we randomly select 200 samples as the training data set and the others as the testing data set.
In order to improve the generalization ability, it is necessary to normalize the attributes of training data sets into the closed interval [0,1] when calculating the Euclidean distance vector Dist. The parameter r reflects the number of support vectors of proportion of training data set, which ranged from 0.1 to 0.3 at 0.05 intervals. The simulation on RCLSSVR is plotted in Figure 1.
RMSE% of RCLSSVR with different r: (a) RMSE% on training data set, (b) RMSE% on testing data set.
As can be seen from Figure 1, the larger parameter r leads to the faster convergence speed. When r is bigger than 0.25, the performance will show fluctuation because the algorithm updates the density center by repeatedly passing through the data set. When the parameter is smaller than 0.2, the velocity of convergence will be too slow in that there is a need for more #SV to make a pass through the training data set. The most attractive characteristic is that the performance will reach the same as the normal LSSVR, when the support vectors are up to a certain number. Considering rapid convergence and good stability, the parameter r, in this paper, will be set to 0.2 for all training data sets.
In order to confirm the effectiveness of DRCLSSVR and observe the selected #SV between RCLSSVR and DRCLSSVR, the comparison of sparse result is displayed in Figure 2. It is clear that DRCLSSVR needs less #SV when reaching almost the same generalization performance. It also proves the support vectors of RCLSSVR are redundant.
The comparison between RCLSSVR and DRCLSSVR.
4.2. Experiment 2: The Simulation on Synthetic Data Sets and Benchmark Data Sets
In this subsection, we compare the performance between different sparse algorithms mentioned above to show the robustness and sparseness of RCLSSVR and DRCLSSVR. The input and output variables of each data set are normalized into closed interval [0,1]. We firstly generate some synthetic data sets according to the properties of real data set to study the heteroscedastic and imbalanced problems. The heteroscedastic data set is generated by the sinc function with two kinds of noise: multiplication noise and heterogeneous variant Gaussian noise. The imbalanced data set also is generated by (20). The difference is that the numbers of samples in different interval are not the same. In addition, the testing data set is directly generated by the sinc function. These data sets are illustrated in Figure 3.
Three synthetic data sets: (a) synthetic data set one with multiplicative noise y=(1+ν)×y, where ν is random noise whose mean is zero and variance is 0.1; (b) synthetic data set two with heterogeneous variant Gaussian noise, where noise with zero mean and the variance is 0.4 and 1.2, respectively on [-10,0] and [0,10]; (c) synthetic data set three is unbalanced data set, which distribute on the interval [-5,5] about 90% data and 5%, respectively, on [-10,-5] and [5,10]; (d) the testing data set. The number of training samples and testing samples is 100.
When test data set comes from the original sinc function relative to the original training data set, the accuracy of prediction model is more sensitive. We gradually increase the support vectors to investigate the robustness of RCLSSVR. The results are shown in Figure 4 and Table 1, which can illustrate that the generalization results of RCLSSVR have better robustness than these contrast algorithms. The performance of IRRLSSVM and SLSSVM suffers on heteroscedastic data sets and do well on imbalanced data set. The performance of PLSSVM and RCLSSVR obtains very good results on above-mentioned data sets. However, the PLSSVM are highly time-consuming, especially for large sample size of data. And it is just for comparing the sparse performance in this paper, not as a selected algorithm. The RCLSSVR has good sparse properties as well as suitable time complexity. The most important is that RCLSSVR will have the best performance and robustness after passing through the training data sets, which is little affected when support vectors increase.
Compare the performance, among these algorithms; NLSSVM with 100 data points, the others with 15 #SV.
Data set
Synthetic data set one
Synthetic data set two
Synthetic data set three
NLSSVM
0.0174
0.0049
0.0558
RCLSSVM
0.0183
0.0048
0.0548
SLSSVM
10.4719
0.0752
1.3758
PLSSVM
0.0162
0.0038
0.0596
IRRLSSVM
0.0564
0.0088
0.0626
Generalization performance of RCLSSVR with different data sets: (a) synthetic data set one; (b) synthetic data set two; (c) synthetic data set three.
Next, we conduct this experiment on some benchmark regression data sets to investigate the sparseness of RCLSSVR and DRCLSSVR, where the data sets Chwirut1 and Nelson are downloaded from http://www.itl.nist.gov/div898/strd/nls/nls_main.shtml; the Housing, Motorcycle, Airfoil_self_noise, and Yacht_hydrodynamics are downloaded from http://archive.ics.uci.edu/ml/datasets.html; the data sets MPG, Pyrim, and Bodyfat are downloaded from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/. The data sets also contain the feature of heteroscedasticity, imbalance, and high dimension, and the detailed information about these data sets is listed in Table 2.
Detailed information of benchmark data sets.
Data set
Dimensionality
Training data
Testing data
Chwirut1
1
150
64
Nelson
2
90
38
Boston Housing
13
400
106
Bodyfat
14
180
71
Pyrim
28
60
14
Yacht_hydrodynamics
6
210
99
Airfoil_self_noise
5
1200
303
Motorcycle
1
100
33
MPG
7
300
92
The comparison results for these benchmark data sets are tabulated in Table 3, where the #SV column gives the number of training samples. Due to the performance closer to the value of normal LSSVM as #SV increases, we will compare the number of support vectors when the performance reaches 90% of the normal LSSVM. The forward algorithms will be stopped when the performance exceeds 90%. For the backward algorithms, the process of training will be terminated after the performance becomes less than 90%. Besides, the RMSE is the performance of testing data set, and the parameters γ1 and σ1 are optimized on original training data set; the parameters γ2 and σ2 are the final optimization results for the RCLSSVR. From Table 3, we can conclude that the proposed algorithms have better sparseness and robustness than other compared methods. As expected, PLSSVM has outstanding performance on RMSE than SLSSVM, but the training time is too expensive. SLSSVM is relatively unfavorable and performs worse in high dimensionality. IRRLSSVM may suffer on the scattered and high dimension data set and is more susceptible to the parameter effects. But the testing time is less than other algorithms when having the same #SV as it does not need to build the model. It is very obvious that the fewer #SV, the less the testing time, and different algorithms have different training time. In practical application, the training time and testing time should be kept in the allowable range, because the sparse algorithms are trained out of line and used online.
Experimental results on benchmark data sets.
Data sets
Algorithms
RMSE
SV
Training time (s)
Testing time (s)
Chwirut1 γ1=5.1539,δ12=0.0036γ2=3.0e+06,δ22=0.3
NLSSVM
0.014 ± 0.001
150
0.043 ± 0.009
0.023 ± 0.004
RCLSSVR
0.015 ± 0.002
12
0.241 ± 0.072
0.001 ± 0.000
DRCLSSVR
0.016 ± 0.001
9
0.250 ± 0.076
0.001 ± 0.000
SLSSVM
0.017 ± 0.002
42
0.484 ± 0.051
0.001 ± 0.000
PLSSVM
0.017 ± 0.000
12
43.094 ± 1.377
0.001 ± 0.000
IRRLSSVM
0.016 ± 0.002
16
0.313 ± 0.045
0.001 ± 0.000
Nelson γ1=13.145,δ12=0.100γ2=53.739,δ22=0.013
NLSSVM
0.016 ± 0.001
90
0.020 ± 0.006
0.005 ± 0.004
RCLSSVR
0.016 ± 0.001
16
0.376 ± 0.014
0.001 ± 0.000
DRCLSSVR
0.017 ± 0.002
11
0.412 ± 0.036
0.001 ± 0.000
SLSSVM
0.019 ± 0.005
50
0.219 ± 0.043
0.003 ± 0.001
PLSSVM
0.019 ± 0.001
130
12.971 ± 1.364
0.002 ± 0.000
IRRLSSVM
0.018 ± 0.000
16
0.166 ± 0.006
0.001 ± 0.000
Boston Housing γ1=554.003,δ12=1.976γ2=3.41e+03,δ22=2.76
4.3. Experiment 3: The Simulation on Mill Load Data Sets
Mill load has not been accurately measured suffering the influence of numerous factors, which is critical in increasing efficiency and energy efficiency of pulverizing system. Paper [34] proposed that the mill load can be represented by mill noise En, mill vibration Ev, mill current I, import-export pressure difference PD, and export temperature T. The estimation of mill load is formulated as(22)ym=fEn,Ev,I,PD,T.
The mill load data set, consisting of 2400 samples, covers three conditions of low load, normal load, and high load. The training and testing sets randomly selected 600 samples and 200 samples, and the model parameters are optimized by the 10-fold cross validation. The comparison results between these sparse algorithms, showed in Figure 5, are coincident with the experimental results on benchmark data sets. RCLSSVR need less #SV to reach the normal performance, with the shorter testing time in the prediction phase. The estimation results of RCLSSVR and DRCLSSVR are illustrated in Figure 6. As can be seen from Figure 6, the testing data set contains different wear conditions, and the model build by RCLSSVR has good estimation performance in all conditions. It is obvious that RCLSSVR considers both sparseness and robustness in the training process and is more conducive to real industrial data sets. The experimental details are tabulated in Table 4.
Experimental results on mill load data set.
Algorithms
#SV
RMSE
Training times
Testing times
NLSSVM
600
2.4014e-04
1.1820
0.1080
RCLSSVR
70
2.4651e-04
4.3930
0.0083
DRCLSSVR
36
2.5139e-04
5.1670
0.0066
SLSSVM
230
2.4897e-04
9.2530
0.0265
PLSSVM
130
2.4724e-04
3.5078e+03
0.0137
IRRLSSVM
140
6.4476e-04
17.0970
0.0161
RMSE% against #SV on mill load data set.
The estimation output of testing data set.
5. Conclusion
Aiming at the problem of LSSVM sparseness and robustness, this paper proposes two sparse algorithms called RCLSSVR and DRCLSSVR via reconstructing the support vectors based on training data set and target function. RCLSSVR selects reconstructed data according to the location information of density clustering in training data set and tweaks the hyperparameter by optimizing the support vector in the training process. DRCLSSVR is proposed for subtracting the redundancy support vectors of RCLSSVR. In order to demonstrate the effectiveness of RCLSSVR and DRCLSSVR, several experiments on function approximation are carried out using sinc data set and benchmark data sets. The experimental results demonstrated that the proposed methods are superior to algorithms compared, not only in the precision of RMSE, but also on the number of support vectors. Finally, RCLSSVR and DRCLSSVR are applied to the mill load prediction and achieve good performance of sparseness and robustness.
Abbreviations LSSVM:
Least square support vector machine
RCLSSVR:
Reconstructed LSSVM
DRCLSSVR:
The improved RCLSSVR
SLSSVM:
Suykens pruning algorithm [10]
PLSSVM:
Backward classic algorithm [12]
IRRLSSVM:
Forward sparse algorithm [19]
RBF:
Radial basis function
RMSE:
Rooted mean squared errors
RMSE%:
The ratio of RMSE
{xi,yi}i=1N:
The training data set
{xi,y^i}i=1N:
The reconstructed data set
{xd,yd}:
The density center
{xd,y^d}:
The support vector
(δ,γ):
The RBF hyperparameter
Dist:
The Euclidean distance vector
Den:
The density indicators for support vectors
r:
The ratio of maximum distance
#SV:
The number of support vectors.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work is supported by National Natural Science Foundation of China (Grant no. 61304118), Program for New Century Excellent Talents in University (NCET-13-0456), and the Specialized Research Fund for Doctoral Program of Higher Education of China (Grant no. 20130201120011).
HuangP.JiaM.-P.ZhongB.-L.Investigation on measuring the fill level of an industrial ball mill based on the vibration characteristics of the mill shellTangJ.YuW.ChaiT.LiuZ.ZhouX.Selective ensemble modeling load parameters of ball mill based on multi-scale frequency spectral features and sphere criterionSiG.CaoH.ZhangY.JiaL.Experimental investigation of load behaviour of an industrial scale tumbling mill using noise and vibration signature techniquesTangJ.ChaiT.YuW.LiuZ.ZhouX.A comparative study that measures ball mill load parameters through different single-scale and multiscale frequency spectra-based approachesRutherfordS. J.ColeD. J.Modelling nonlinear vehicle dynamics with neural networksBengioY.ChapadosN.DelalleauO.LarochelleH.Saint-MleuxX.HudonC.LouradourJ.Detonation classification from acoustic signature with the restricted Boltzmann machineLiQ.-B.YanH.-L.LiL.-N.WuJ.-G.ZhangG.-J.Application of partial robust M-regression in noninvasive measurement of human blood glucose concentration with near-infrared spectroscopyChenL.LiuH.Application of LS-SVM in fault diagnosis for diesel generator set of marine power station41Proceedings of the 2013 International Conference on Advanced Computer Science and Electronics Information (Icacsei 2013)2013101104SuykensJ. A. K.De BrabanterJ.LukasL.VandewalleJ.Weighted least squares support vector machines: robustness and sparce approximationSuykensJ. A. K.LukasL.VandewalleJ.Sparse approximation using least squares support vector machines2Proceedings of the 2000 IEEE International Symposium on Circuits and Systems, Emerging Technologies for the 21st Century (ISCAS '00)May 2000Geneva, SwitzerlandInternational Conference Center757760de KruifB. J.de VriesT. J. A.Pruning error minimization in least squares support vector machinesHoegaertsL.SuykensJ. A. K.VandewalleJ.De MoorB.A comparison of pruning algorithms for sparse least squares support vector machines3316Proceedings of the 11th International Conference Neural Information Processing (ICONIP '04)January 2004Calcutta, IndiaSpringer Berlin Heidelberg12471253Lecture Notes in Computer Science10.1007/978-3-540-30499-9_194ZengX. Y.ChenX. W.SMO-based pruning methods for sparse least squares support vector machinesSongH. Y.GuiW. H.YangC. H.Sparse least squares support vector machine and its applicationsYuZ.-T.ZouJ.-J.ZhaoX.SuL.MaoC.-L.Sparseness of least squares support vector machines based on active learningJiaoL. C.BoL. F.WangL.Fast sparse approximation for least squares support vector machineYongpingZ.JianguoS.Fast method for sparseleast squares support vector regression machineZhaoY.SunJ.Recursive reduced least squares support vector regressionZhaoY.-P.SunJ.-G.DuZ.-H.ZhangZ.-A.ZhangY.-C.ZhangH.-B.An improved recursive reduced least squares support vector regressionSunL. G.de VisserC. C.ChuQ. P.MulderJ. A.A novel online adaptive kernel method with kernel centers determined by a support vector regression approachNairP. B.ChoudhuryA.KeaneA. J.Some greedy learning algorithms for sparse regression and classification with Mercer kernelsSilvaJ. P.Da Rocha NetoA. R.Sparse least squares support vector machines via genetic algorithmsProceedings of the 11th Brazilian Congress on Computational Intelligence (BRICS '13)September 2013Ipojuca, BrazilIEEE24825310.1109/BRICS-CCI-CBIC.2013.482-s2.0-84905404073SilvaD. A.SilvaJ. P.Rocha NetoA. R.Novel approaches using evolutionary computation for sparse least square support vector machinesYangL.YangS.ZhangR.JinH.Sparse least square support vector machine via coupled compressive pruningYingZ.KeongK. C.Fast leave-one-out evaluation and improvement on inference for LS-SVMs3Proceedings of the 17th International Conference on Pattern Recognition (ICPR '04)August 2004Cambridge, UK49449710.1109/ICPR.2004.13345742-s2.0-10044278059CaoP.LiuX.ZhangJ.2, 1 norm regularized multi-kernel based joint nonlinear feature selection and over-sampling for imbalanced data classificationZhangJ.WuX.ShengV. S.Imbalanced multiple noisy labelingYanR.LiuY.JinR.HauptmannA.On predicting rare classes with SVM ensembles in scene classificationProceedings of the 2003 IEEE International Conference on Accoustics, Speech, and Signal ProcessingApril 2003Hong Kong, China21242-s2.0-0141453538YenS.-J.LeeY.-S.DasB.KrishnanN. C.CookD. J.RACOG and wRACOG: Two probabilistic oversampling techniquesWongH.LiuF.ChenM.IpW. C.Empirical likelihood based diagnostics for heteroscedasticity in partially linear errors-in-variables modelsLejeuneB.A diagnostic m-test for distributional specification of parametric conditional heteroscedasticity models for financial dataGinkerT.LiebermanO.Robustness of binary choice models to conditional heteroscedasticitySiG. Q.CaoH.ZhangY. B.JiaL. X.Density weighted pruning method for sparse least squares support vector machines