Probability Distribution and Deviation Information Fusion Driven Support Vector Regression Model and Its Application

Inmodeling, only information from the deviation between the output of the support vector regression (SVR)model and the training sample is considered, whereas the other prior information of the training sample, such as probability distribution information, is ignored. Probabilistic distribution information describes the overall distribution of sample data in a training sample that contains different degrees of noise and potential outliers, as well as helping develop a high-accuracy model. To mine and use the probability distribution information of a training sample, a new support vector regression model that incorporates probability distribution information weight SVR (PDISVR) is proposed. In the PDISVR model, the probability distribution of each sample is considered as the weight and is then introduced into the error coefficient and slack variables of SVR. Thus, the deviation and probability distribution information of the training sample are both used in the PDISVRmodel to eliminate the influence of noise and outliers in the training sample and to improve predictive performance. Furthermore, exampleswith different degrees of noisewere employed to demonstrate the performance of PDISVR, which was then compared with those of three SVR-basedmethods.The results showed that PDISVR performs better than the three other methods.


Introduction
Since its proposal by Vapnik, the support vector machine (SVM) has been used in many areas, including both pattern recognition and regression estimation [1,2].The original SVM is utilized to provide a pair of parameters as a solution to a quadratic program problem.SVM has some advantages, such as low standard deviation and easy generation, as well as some disadvantages, such as the redundancy of the regression function and the low efficiency of support vector selection.To address these disadvantages, various improvements to the support vector algorithm and its kernel function have been proposed.Suykens proposed least-square support vector regression (LS-SVR) for a regression modeling problem [3,4].By transferring inequality constraints to equality constraints, LS-SVR simplifies the solution to quadratic program problems [5].In the field of regression, Smola proposed the linear programming support vector regression (LP-SVR) model [6,7].LP-SVR has numerous strengths, such as the using of more general kernel functions and fast learning ability.LP-SVR can control the accuracy and sparseness of the original SVR by using the linear kernel combination as a solution approach.In addition, a new kernel function, multikernel function (MK), has been introduced into the standard SVM model.MK provides lower fault and requires a shorter training period than the original kernel function.Multiplekernel SVR (MKSVR) is very popular in some systems.Yeh et al. [8] developed MKSVR for stock market forecasts.Lin and Jhuo [9] discovered a method to generate MKSVR parameters for integration into a system that converts the pixels of a checkpoint into the brightness value.Zhong and Carr [10] used the MKSVR model to estimate pure and impure carbon dioxide-oil matrix metalloproteinases in a CO 2 enhanced oil recovery process.
The SVR model also has been improved by prior knowledge [11,12].There are numerous types of prior knowledge, including the average value and monotonicity of the sample data.To appropriately use prior knowledge, three types of methods are utilized in SVR [13].Our team previously worked on the monotonous a priori knowledge of sample data.Our monotonous a priori knowledge of the sample data is described by first-order difference inequality constraints of kernel expansion and additive kernels [14].The constraints are directly added to kernel formulation to acquire a convex optimization problem.For additive kernels, SVMs are conducted through the addition of dissociate kernels for every input dimension.These operations confer higher accuracy to the SVR model in support vector (SV) selection.
Inevitably, even small noise can debase the accuracy of the model.Furthermore, in some situations, part of the noisy information may be ten to even dozens of times larger than the normal data.These outliers introduce bias and inaccuracies to SVR.Nevertheless, the probability distribution of the sample data is a good indicator of noise.From the perspective of the probability distribution of sample data, normal data and data that contain the least amount of noise have the highest probability in the sample data.By contrast, data that contain large amount of noise have relatively small probability.Thus, outliers in the sample data will have the smallest probability.Therefore, the probability distribution is the prior knowledge that helps weaken the influence from noise and outliers in the sample data.We consider this information to modify our SVR model.This article is structured as follows: Section 2 introduces standard SVR algorithms.Section 3 describes the proposed algorithm that integrates probability distribution information into the SVR framework.Section 4 provides some experimental results that were obtained from comparing the proposed algorithm with other algorithms.Finally, Section 5 presents some conclusions about the proposed algorithm.

Review of SVR
To better describe the proposed algorism, the mathematical clarification of the basic concepts of SVR and the usage of deviation information should be provided.

Support Vector Regression (SVR)
. SVR is originally used to solve linear regression problems.For given training samples X = {(x  ,   ) | x  ∈   ,   ∈ ,  = 1, 2, . . ., }, fitting aims to find the dependency between the independent variable x  and the dependent variable   .Specifically, it aims to identify an optimal function and minimize prospective risk () = ∫ (, (x,))(x, ), where {(x, )} is predictive function set,  ∈ Ω is the generalized parameters of the function, (, (x,)) is the loss function, and is the fitting function [15].Thus, the solution of the optimal linear function for SVR is expressed as the following constraint optimization problem: where the penalty coefficient C that determines the accuracy of the function fitting and the degree of the error greater than  is given in advance.Parameter  is used to control the size of the fitting error, the size of the support vector, and the size of the generalization capability.Taking into account the accuracy of the fitting error, the introduction of slack variables   ,  *  becomes necessary.Figure 1 in reference [10] illustrates this linear fitting problem.
However, the previous solution is only for a linear regression problem.Nonlinear regression necessitates the kernel function in the SVR model [16].The kernel function can be expressed as follows: where  is the mapping from a low-dimensional space to a high-dimensional space.The independent variable x  becomes a vector that should be mapped to a feature space so that a nonlinear problem could be changed into a linear problem.After introducing the kernel function, the new fitting function becomes where the symbol X  indicates the transpose of the matrix X.
The changing of the fitting function leads to the following constraint optimization problem: In this constraint optimization problem, the length of  and x is n.The notion K(⋅, ⋅) is the kernel function that fulfills Mercer's requirements.The standard SVR is a compromise between structural risk minimization and empirical risk minimization.In particular, for the support vector regression learning algorithm, the structural risk term is (1/2)⟨ ⋅ ⟩ and the empirical risk item is ∑  =1 (  +  *  ).However, calculating the structural risk term (1/2)⟨ ⋅ ⟩ requires enormous time and resources [17].Researchers found counting the minimization of the 1-norm of the parameter  will reduce the time and resources spent on calculation.Then, the optimization formula turns into the following form: Although the time and resource spent on modeling are reduced, there is no considerable difference in the final accuracy.

Support Vector Regression with Deviation Information as a Consideration.
Traditional SVR does not possess a special method for addressing noise in sample data.An efficient way to weaken noise is to adjust parameters in the SVR model.These parameters are called hyperparameters in SVRs.
Hyperparameters exert a considerable impact on algorithm performance.The general way to test the performance of hyperparameters is via the deviation between the model output and the sample data [18].The obtained deviation is then compared with other deviations to select the minimum deviation as the final result.The parameters that correspond to the minimum deviation are the best parameters in the optimization process.Usually, this process is conducted using an intelligent optimization algorithm, such as particle swarm optimization (PSO) [19] and genetic algorithm (GA) [20].The deviation is set as the fitness function in an intelligent optimization algorithm.In this section, we refer to this method as deviation-minimized SVR (DM-SVR).
In most of the circumstances, the deviation between the model output and sample data is represented by the correlation coefficient  or the mean square error (MSE).Given vector ŷ as the model output and vector y as the sample output, the correlation coefficient r can be expressed as The formula for mean square error (MSE) is as follows: In short, if the value of MSE is close to zero and the value of r is close to one, that group of parameters will produce the best performance.

Probability Distribution Information Weighted Support Vector Regression
Although DM-SVR can reduce influence from the noise, it also has some weaknesses.The main disadvantage of this method is the time it spends on training.There are many parameters that need to be optimized in SVR.If there are extra parameters to optimize, these works would make the train process inefficient.To solve the uncertainty of error parameter , we introduce the probability distribution information (PDI) into SVR and designate it as PDISVR.

Probability Distribution of the Output.
The probability distribution information is the same as the probability distribution function and describes the likelihood that the output value of a continuous random variable is near a certain point.
Integrating the probability density function is the proper way to calculate the probability value of the random variable in that certain region.From the sample data, we could set the frequency of output to appear as different values.Then, we set frequency as (y), where y is the output value vector.Let (y) be the probability of the sample's output.Therefore, the relationship between (y) and (y) can be expressed as where Y is the range of y.Then, we can easily obtain the probability distribution function.The next step is the identification of the probability of every point.

Optimization Formula with Probability Distribution
Information Weight.Once we have obtained the probability distribution of output, it should be integrated into the basic SVR model.In the basic SVR model, the error parameter  indicates the accuracy of model fitting by providing an area that does not have any loss for the objective function.However, due to the influence of noise, some sample data contain excessive noise information.If the same parameters  are adapted, the performance of the model is reduced.To prevent this situation, SVR should be adjusted in accordance with noise information.We propose illustrating noise information through the probability distribution of the output.Samples in the regions with low probability distributions have a relatively large proportion of noise.For this reason, in modeling, the region with higher probability should have a smaller error parameter than the lower probability region.Thus, the probability distribution function increases the accuracy of the SVR model in the area with the high probability of output.Define the -insensitive loss function as where (x) is a regression estimation function constructed by learning the sample and y is the target output value that corresponds to x.By defining the -insensitive loss function, the SVR model specifies the error requirement of the estimated function on the sample data.This requirement is the same for every sample point.To avoid this situation, the artificially set error parameter  is divided by the probability distribution vector (y).Figure 1 illustrates the change from a constant  to a vector .The distance between two hyperplanes has been modified in some area where the density of the points becomes different.Furthermore, in the high-density area, the model has a smaller error parameter.By contrast, in the low-density area, the model has a large error parameter.The density of the output points is directly related to the probability of the sample's output y.Therefore, the division of PDI would make the SVR model emphasize the area with a high density of points.This technique can improve overall accuracy despite sacrificing the accuracy of low-density areas.
According to ( 9) and (10), the PDISVR can be expressed as By comparing (11) with the standard form of SVR, we can see that the error parameter  changes in accordance with   .
Then, the PDISVR model will have low error tolerance for the high density of points.
To further improve the performance of the SVR model, we consider adding an extra fragment to the PDISVR framework.The PDISVR model only has a unique error parameter .However,  is too small to have an obvious impact on the accuracy of the model.Hence, we propose an additional method to introduce PDI; that is, we apply the same operation as that on error parameter  on the slack variable .Given the treated error parameter  in the PDISVR model, we divided

Experimental Results
To verify the effect of the probability distribution information on the standard SVR model, we employed three kinds of numerical experiments with real datasets.In these experiments, we considered three kernels including linear kernel, polynomial kernel, and Gaussian kernel as SVRs' kernel functions.All of the experiments were operated on MATLAB with Intel i5 CPU and 6 GB internal storage.
Experimental studies have mainly compared different SVR models, including basic SVR, MKSVR and heuristic weighted SVR [21].The correlation coefficient r and mean square error (MSE) are used to evaluate generalization performance.The formulas of these two criteria are listed in Section 2.2.
In the above function, the symbol   () indicates the random fluctuation variable between − and k.From the range of [2.1, 9.9] for   above, we generated 100 data at random.Through (13), we obtained the output of these 100 data.We evenly divided the 100 data into five parts, which comprised four training parts and one testing part.After the cross-validation method introduced in Section 3.3, optimal hyperparameters for different SVR algorithms are selected.
After obtaining the optimal hyperparameters, we then determined the influence of noise on different SVR algorithms.The range of the magnitude of the noise k, which was set as 0.1, 0.5, 1, 3, 6, and 10, was set in accordance with the output range.To obtain objective comparisons, 10 groups of noise were added to each training sample of the algorithm using the MATLAB toolbox, which completely generated 10 training datasets.Moreover, testing data were directly generated from the objective function equation (13).The results for the criterion of these ten experiments are recorded by their average and standard deviation values, as shown in Figures 3-5.
In these three figures, the average criterion values indicated that the general performance of the algorithm and the standard deviation are representative of the algorithm's stability.From Figures 3-5, we can see that the performance of the proposed PDISVR is less affected by noisy information than those of the other three SVR algorithms.In Figure 3, the result line of PDISVR is more stable than other three methods.And it achieves best prediction performance when adding larger noise in samples among all models.Compared to Figure 3, the PDISVR's ability to predict is not always the best in Figures 4 and 5.That means the PDISVR with linear kernel is suitable for this dataset.Besides, in the area with large intensity of noise, the basic SVR and MKSVR poorly handled the effects of noise.Although HW-LSSVR resisted some of the effects of noise, its performance slightly worsened with a high intensity of noisy samples.The average of the prediction accuracy and standard deviation of PDISVR were relatively better in fitting models with noise of 1, 3, 6, and 10.With noise of 0.1 and 0.5, although the differences among the average values were small, the PDISVR was more stable than other algorithms in some certain circumstances.

Example 2.
The effects from rough error cannot be ignored in real production processes.To better simulate real production conditions and reveal the robustness of the proposed PDISVR when the training samples involved outliers, the rough error term should be added to the function in the previous model.A total of 80 data with a noise intensity of 1 were haphazardly generated by (13) as a fundamental training sample.Test samples containing 20 data were also generated by (13).Then, the dependent variables of the 17th and 48th data in the fundamental training sample were attached 4.5 ×  and 3 × , respectively, as two trivial outliers.The dependent variable of the 50th datum in the fundamental training sample was attached 10 ×  as one strong outlier.Thus, the new training sample that contained one strong outlier, that is, the 50th datum, and two trivial outliers, that is, the 17th and 48th data, was constructed.
To better compare the predictive performance of the different SVR algorithms, the same four algorithms were trained ten times in samples with three outliers.The average values and standard deviation values of  and MSE represented the performance of these algorithms.
As indicated in Tables 2-4, the PDISVR algorithm performed better in the testing experiments than the other algorithms.The unweighted SVR and MKSVR were influenced by noise and produced biased estimates in predicting results, whereas PDISVR dramatically reduced this secondary action.Given the misjudgments on the outliers in this complicated system, the HWSVR algorithm could not obtain satisfactory results even when it adapted weighted error parameters.
In order to illustrate the quality of PID-SVR's weight element, we compared the weighting results in Table 2.The weight values of the PDISVR algorithm in training samples  are listed in Figure 6.The weight values of the HWSVR algorithm are listed in Figure 7.As shown in Figure 6, the weights of two trivial outliers were 0 and 0.10531 for the 17th and 48th data, respectively, and the weight of one strong outlier was 0.00036, which indicated that the PDISVR precisely detected the outliers.As shown in Figure 7, the HWSVR did not perform as well as PDISVR.One strong outlier had a weight of 0.0751 and two trivial outliers had weights of 0.3143 and 0.2729, which were unsuitable for modeling given that smaller weights, such as that of the 23rd datum (0.0126),  could affect outlier detection.Thus, the influence from two trivial outliers on the predictability of the PDISVR algorithm was reduced and the influence from one strong outlier was eliminated.By contrast, the effect of outliers remained in the HWSVR algorithm.

Example 3.
To test our regression model in a more realistic way, we imported six more realistic datasets from the UCI Machine Learning Repository [23][24][25], Department of Food Science, University of Copenhagen database [26], and some real chemical industrial process [14].See Table 5 for more detailed information.In these datasets, four out of five data were used for training and one-fifth of the data was used for testing.The hyperparameters used in this example are also obtained by the process introduced in Section 3.3.
As shown in Tables 6-8, the proposed PDISVR obtained the best predictive ability in the majority of the criterion.For example, in the case of Auto-MPG, the proposed PDISVR was best achieved with both standards.Thus, the proposed PDISVR is appropriate for the Auto-MPG dataset.In the datasets for crude oil distillation and computer hardware, the proposed PDISVR only obtained the best correlation coefficient r and could not establish a suitable model at a point where the probability distribution was low, thus increasing the MSE.Therefore, the use of PDISVR requires validation  noise according to its probability distribution.Thus, the proposed PDISVR is applied to improve the SVR in the case of large datasets.

Conclusion
In traditional SVR modeling, the deviation between the model outputs and the real data is the only way to represent the influence of noise.Other information, such as possibility distribution information, is not emphasized.Therefore, we proposed a special method that uses the possibility distribution information to modify the basic error parameter and slack variables of SVR.Given that these parameters are weighted by the probability distribution of the model output points, they can be adjusted by SVR itself and no longer require optimization by any intelligent optimization algorithm.The proposed algorithm is superior to other SVRbased algorithms in dealing with noisy data and outliers in simulation and actual datasets.

Figure 1 :
Figure 1: Linear problem illustration of a PDISVR model.

Figure 3 :
Figure 3: Four models' prediction results in linear kernel with different noise: (a) is the average value of correlation coefficient , (b) is the average value of MSE, (c) is the standard deviation of correlation coefficient , and (d) is the standard deviation of MSE.

Figure 4 :
Figure 4: Four models' prediction results in radial basis function kernel with different noise: (a) is the average value of correlation coefficient , (b) is the average value of MSE, (c) is the standard deviation of correlation coefficient , and (d) is the standard deviation of MSE.

Figure 5 :
Figure 5: Four models' prediction results in polynomial kernel with different noise: (a) is the average value of correlation coefficient , (b) is the average value of MSE, (c) is the standard deviation of correlation coefficient , and (d) is the standard deviation of MSE.

Figure 6 :Figure 7 :
Figure 6: Weights of the training sample for PDISVR.

Table 1 :
Typical parameters for PSO algorithms.
[10]−K (x, X  )  −  ≤ 1  (  ) ( +   )   ,  *  ≥ 0,  = 1, 2, ..., .(12)3.3.Parameters Optimization Based on PSO.Normally, the performance of the different SVRs is intensively dependent on the parameters selection.A PSO based hyperparameters selecting method in[10]is used in this paper.After dataset was normalized, the control parameters of PSO including maximum velocity ( max ), minimum velocity ( min ), initial inertia weight ( min ), final inertia weight ( max ), cognitive coefficient ( 1 ), social coefficient ( 2 ), maximum generation, and population size should be initialized according to the experience of operators.In our experiment, we set the control parameters of PSO based on

Table 1 .
In the following experiments, there are at most five hyperparameters of different SVRs that need to be optimized in PSO.These parameters include penalizing coefficient (C), radial basis function kernel parameter (), and polynomial degree (d), () in -insensitivity function and mixing coefficient (m) in multikernel function.In our experiments, different comparative methods adapt different groups of parameters and, in order to search the global optimum reasonably, the parameter m is limited in [0, 1],  in [10 −2 , 10], d in [1, 10], C in [10, 1000], and  in [10 −4 , 10 −1 ].During the process of searching best parameters by PSO, particles update their positions by changing velocity and converge finally at a global optimum within the searching space.In this study, the parti- (8)s' positions are the combination of m ,, d, C, and , which are denominated as P. Then V-fold (V = 5) cross-validation resampling method is applied to validate the performance of searched best parameters until criterions are met.To evaluate the performance of training process, mean square error (MSE) is chosen as the fitness function, which is formulated as(8).Figure2shows the workflow to find the optimum values of each parameter in models.

Table 2 :
Testing results of SVR algorithms with rough error (linear kernel).

Table 3 :
Testing results of SVR algorithms with rough error (radial basis function kernel).

Table 4 :
Testing results of SVR algorithms with rough error (polynomial kernel).

Table 5 :
Details of the experimental datasets.

Table 6 :
Comparative results of previous SVR models in real datasets (linear kernel).

Table 7 :
Comparative results of previous SVR models in real datasets (radial basis function kernel).

Table 8 :
Comparative results of previous SVR models in real datasets (polynomial kernel).