Least Absolute Deviation Support Vector Regression

Least squares support vector machine (LS-SVM) is a powerful tool for pattern classification and regression estimation. However, LS-SVM is sensitive to large noises and outliers since it employs the squared loss function. To solve the problem, in this paper, we propose an absolute deviation loss function to reduce the effects of outliers and derive a robust regression model termed as least absolute deviation support vector regression (LAD-SVR). The proposed loss function is not differentiable. We approximate it by constructing a smooth function and develop a Newton algorithm to solve the robust model. Numerical experiments on both artificial datasets and benchmark datasets demonstrate the robustness and effectiveness of the proposed method.


Introduction
Support vector machine (SVM), introduced by Vapnik [1] and Cristianini and Taylor [2], has been gaining more and more popularity over the past decades as a modern machine learning approach, which has strong theoretical foundation and successes in many real-world applications.However, its training computational load is great, that is, ( 3 ), where  is the total size of training samples.In order to reduce the computational effort, many accelerating algorithms have been proposed.Traditionally, SVM is trained by means of decomposition techniques such as SMO [3,4], chunking [5], SVM light [6], and LIBSVM [7], which solve the dual problems by optimizing a small subset of the variables during the iteration procedure.Another kind of accelerating algorithm is the least squares SVM introduced by Suykens and Vandewalle [8] which replaces inequality constraints with equality ones, requiring to solve a linear system of equations and results in an extremely fast training speed.
LS-SVM obtains good performance on various classification and regression estimation problems.In LS-SVR, it is optimal when the error variables follow a Gaussian distribution because it tries to minimize the sum of squared errors (SSE) of training samples [9].However, datasets subject to heavy-tailed errors or outliers are commonly encountered in various applications and the solution of LS-SVR may suffer from lack of robustness.In recent years, much effort has been made to increase the robustness of LS-SVR.The commonly used approach adopts the weight setting strategies to reduce the influence of outliers [9][10][11][12][13].In these LS-SVR methods, different weight factors are put on the error variables such that the less important samples or outliers have smaller weights.Another approach improves LS-SVR's performances by means of outlier elimination [14][15][16][17].Essentially, LS-SVR is sensitive to outliers since it employs the squared loss function which overemphasizes the impact of outliers.
In this paper, we focus on the situation in which the heavy-tailed errors or outliers are found in the targets.In such a situation, it is well known that the traditional least squares (LS) may fail to produce a reliable regressor, and the least absolute deviation (LAD) can be very useful [18][19][20].Therefore, we exploit the absolute deviation loss function to reduce the effects of outliers and derive a robust regression model termed as least absolute deviation SVR (LAD-SVR).Due to the fact that the absolute deviation loss function is not differentiable, the classical optimization method cannot be used directly to solve the LAD-SVR.Recently, some algorithms in the primal space for training SVM have been proposed due to their effective computation.Moreover, it is pointed out that the primal domain methods are superior to the dual domain methods when the goal is to find an approximate solution [21,22].Therefore, we approximate LAD-SVR by constructing a smooth function and develop a Newton algorithm to solve the robust model in the primal space.Numerical experiments on both artificial datasets and benchmark datasets reveal the efficiency of the proposed method.
The paper is organized as follows.In Section 2, we briefly introduce classical LS-SVR and LS-SVR in the primal space.In Section 3, we propose an absolute deviation loss function and derive LAD-SVR.A Newton algorithm for LAD-SVR is given in Section 4. Section 5 performs experiments on artificial datasets and benchmark datasets to investigate the effectiveness of LAD-SVR.In Section 6, some remarkable conclusions are given.

Least Squares Support Vector Regression
2.1.Classical LS-SVR.In this section, we concisely present the basic principles of LS-SVR.For more details, the reader can refer to [8,9].Consider a regression problem with a training dataset {(  ,   )}  =1 , where   ∈   is the input variable and   ∈  is the corresponding target.To derive a nonlinear regressor, LS-SVR can be obtained through solving the following optimization problem: where   represents the error variables,  represents the model complexity, (⋅) is a nonlinear mapping which maps the input data into a high-dimensional feature space, and  > 0 is the regularization parameter that balances the model complexity and empirical risk.To solve (1), we need to introduce Lagrangian multipliers and construct a Lagrangian function.Utilizing the Karush-Kuhn-Tucker (KKT) conditions, we get the dual optimization problem where  = (1, 1, . . ., 1) ⊤ , y = ( 1 ,  2 , . . .,   ) ⊤ ,   denotes × identity matrix,  = (  ) × is the kernel matrix with   = (  ,   ) = (  ) ⊤ (  ), and (⋅, ⋅) is the kernel function.By solving (2), the regressor can be gained as 2.2.LS-SVR in the Primal Space.In this section, we describe LS-SVR solved in the primal space following the growing interest in training SVMs in the primal space in the last few years [21,22].Primal optimization of an SVM has strong similarities with the dual strategy [21] and can be implemented by the widely popular optimization techniques.
The optimization problem of LS-SVR (1) can be described as min w, where  1 () =  2 with  =  − (), and is a squared loss function, as shown in Figure 1.In the reproducing kernel Hilbert space H, we rewrite the optimization problem (4) as For the sake of simplicity, we can drop the bias  without loss of generalization performance of SVR [21].According to [21], the optimal function for (5) can be expressed as a linear combination of the kernel functions centering the training samples: Substituting ( 6) into (5), we have where  = ( 1 ,  2 , . . .,   ) ⊤ and   is the th row of kernel matrix .

Least Absolute Deviation SVR
As mentioned, LS-SVR is sensitive to outliers and noises with the squared loss function  1 () =  2 .When there exist outliers which are far away from the rest of samples, large errors will dominate SSE and the decision hyperplane of LS-SVR will severely deviate from the original position deteriorating the performance of LS-SVR.
In this section, we propose an absolute deviation loss function  2 () = || to reduce the influence of outliers.This phenomenon is graphically depicted in Figure 1, which shows the squared loss function  1 () =  2 and the absolute deviation one  2 () = ||, respectively.From the figure, the exaggerative effect of  1 () =  2 at points with large errors, as compared with  2 () = ||, is evident.
The robust LAD-SVR model can be constructed as However,  2 () is not differentiable, and the associated optimization problem is difficult to be solved.Inspired by the Huber loss function [23], we propose the following loss function: where ℎ > 0 is the Huber parameter, and its shape is shown in Figure 1.It is verified that  3 () is differentiable.For ℎ → 0,  3 () approaches  2 ().Replacing  2 () with  3 () in ( 8), we obtain

Newton Algorithm for LAD-SVR
Noticing that the objective function L() of ( 10) is continuous and differentiable, (10) For the sake of clarity, we suppose that the two groups are arranged in the order of  1 and  2 .Furthermore, we define  ×  diagonal matrices  1 and  2 , where  1 has the first | 1 | entries being 1 and the others 0, and  2 has the entries from | 1 |+1 to | 1 |+| 2 | being 1 and the others 0.Then, we develop a Newton algorithm for (10).The gradient is where y = ( 1 , . . .,   ) ⊤ and s = ( 1 , . . .,   ) ⊤ with   = sgn(  ).The Hessian matrix at the th iteration is The Newton step at the ( + 1)th iteration is The inverse of   +(/ℎ) 1  can be calculated as follows: The computational complexity of  −1 is ((| 1 |) 3 ).Substituting ( 14) into (13), we obtain Having updated  +1 , we get the corresponding regressor The flowchart of implementing LAD-SVR is depicted as follows.

Experiments
In order to test the effectiveness of the proposed LAD-SVR, we conduct experiments on several datasets, including six artificial datasets and nine benchmark datasets, and compare it with LS-SVR.Gaussian kernel is selected as the kernel function in the experiments.All the experiments are implemented on Intel Pentium IV 3.00 GHz PC with 2 GB of RAM using Matlab 7.0 under Microsoft Windows XP.The linear system of equations in LS-SVR is realized by Matlab operation "\".Parameters selection is a crucial issue for modeling with the kernel methods, because improper parameters, such as the regularization parameter  and kernel parameter , will severely affect the generalization performance of SVR.Grid search [2] is a simple and direct method, which conducts an exhaustive search on the parameters space with the validation minimized.In this paper, we employ grid search for searching their optimal parameters such that they can achieve best performance on the test samples.
To evaluate the performances of the algorithms, we adopt the following four popular regression estimation criterions: root mean square error (RMSE) [24], mean absolute error (MAE), ratio between the sum squared error SSE and the sum squared deviation testing samples SST (SSE/SST) [25], and ratio between interpretable sum deviation SSR and SST (SSR/SST) [25].These criterions are defined as follows. ( ( ( where  is the number of testing samples,   denotes the target, ŷ is the corresponding prediction, and  = (1/) ∑  =1   .RMSE is commonly used as the deviation measurement between the real and predicted values.It represents the fitting precision.The smaller RMSE is, the better fitting performance is.However, when noises are also used as testing samples, too small value of RMSE probably means overfitting of the regressor.MAE is also a popular deviation measurement between the real and predicted values.In most cases, small SSE/SST indicates good agreement between estimations and real values.Obtaining smaller SSE/SST usually accompanies an increase in SSR/SST.However, the extremely small value of SSE/SST is in fact not good, for it probably means overfitting of the regressor.Therefore, a good estimator should strike balance between SSE/SST and SSR/SST.
Type VI: where (0,  2 ) represents the Gaussian random variable with zero means and variance  2 , [, ] denotes the uniformly random variable in [, ], and () depicts the student random variable with freedom degree .
In order to avoid biased comparisons, for each kind of noises, we randomly generate ten independent groups of noisy samples which, respectively, consist of 350 training samples and 500 test samples.For each training dataset, we randomly choose 1/5 samples and add large noise on their targets to simulate outliers.The testing samples are uniformly from the objective Sinc function without any noise.Table 1 shows the average accuracies of LS-SVR and LAD-SVR with ten independent runs.From Table 1, we can see that LAD-SVR has advantages over LS-SVR for all types of noises in terms of RMSE, MAE, and SSE/SST.Hence, LAD-SVR is robust to noises and outliers.Moreover, LAD-SVR derives larger SSR/SST value for three types of noises (types II, IV, and V).From Figure 2, we can see that the LAD-SVR follows the actual data more closely than LS-SVR for most of the test samples.The main reason is that LAD-SVR employs an absolute deviation loss function which reduces the penalty of outliers in the training process.The histograms of LS-SVR and LAD-SVR for distribution of the error variables   for these different types of noises are shown in Figure 3.We notice that the histograms of LAD-SVR for all types of noises are closer to Gaussian distribution, compared with LS-SVR.Therefore, our proposed LAD-SVR derives better approximation than LS-SVR.(  targets to simulate outliers.Similar to the experiments on artificial datasets, the testing datasets are not added any noise on their targets.All the regression methods are repeated ten times with different partition of training and testing dataset.Table 2 displays the testing results of LS-SVR and the proposed LAD-SVR.We observe that the three criterions (RMSE, MAE, and SSE/SST) of LAD-SVR are obviously better than LS-SVR on all datasets, which shows that the robust algorithm achieves better generalization performance and has good stability as well.Moreover, our LAD-SVR algorithm outperforms LS-SVR.For instance, LAD-SVR obtains the smaller RMSE, MAE, and SSE/SST on the Bodyfat dataset; meanwhile, it keeps larger SSR/SST than LS-SVR.The proposed algorithm derives the similar results on the MCPU, AutoMPG, and BH datasets.

Experiments on
To obtain the final regressor of LAD-SVR, the resultant model is implemented in the primal space by classical Newton algorithm iteratively.The number of iterations (Iter) and the running time (Time) including the training and testing time are listed in Table 2. Iter shows the average number of iterations of ten independent runs.Compared with LS-SVR, LAD-SVR requires more running time.The main reason is that the running time of LAD-SVR is affected by the selection approach of the starting point  0 , the value of | 1 |, and the number of iterations.In the experiments, the starting point  0 is derived by LS-SVR on a small number of training samples.It can be observed that the average number of iterations does not exceed 10, which implies that LAD-SVR is suitable enough for medium and large scale problems.We notice that LAD-SVR does not burden the running time severely.A worse case is that the maximum ratio of their speeds is no more than 3 times on Pyrim dataset.These experimental results conclude that the proposed LAD-SVR is effective in dealing with robust regression problems.

Conclusion
In this paper, we propose LAD-SVR, a novel robust least squares support vector regression algorithm on dataset with outliers.Compared with the classical LS-SVR which is based on squared loss function, LAD-SVR employs an absolute deviation loss function to reduce the influence of outliers.
To solve the resultant model, we smooth the proposed loss function by a Huber loss function and develop a Newton algorithm.Experimental results on both artificial datasets and benchmark datasets confirm that LAD-SVR owns better robustness compared with LS-SVR.However, LAD-SVR still loses sparseness as LS-SVR.In the future, we plan to develop more efficient LAD-SVR to improve the sparseness and robustness.
can be easily solved by Newton algorithm.At the th iteration, we divide the training samples into two groups according to |   | ≤ ℎ and |   | > ℎ.Let  1 = { | |   | ≤ ℎ} denote the index set of samples lying in the quadratic part of  3 () and  2 = { | |   | > ℎ} the index set of samples lying in the linear part of  3 ().| 1 | and | 2 | represent the number of samples in  1 and
MCPU) from the web page (http://www.dcc.fc.up.pt/∼ ltorgo/Regression/DataSets.html), which are widely used in evaluating various regression algorithms.The detailed descriptions of datasets are presented in Table2, where #train and #test denote the number of training and testing samples, respectively.In experiments, each dataset is randomly split into training and testing samples.For each training dataset, we randomly choose 1/5 samples and add large noise on their

Table 2 :
Experiment results on Benchmark datasets.