An Efficient Feature Weighting Method for Support Vector Regression

Support vector regression (SVR) is a powerful kernel-based method has been successfully applied in regression problems. Regarding the feature-weighted SVR algorithms, its contribution to model output has been taken into account. However, the performance of the model is subject to the feature weights and the time consumption on training. In the paper, an eﬃcient feature-weighted SVR is proposed. Firstly, the value constraint of each weight is obtained according to the maximal information coeﬃcient which reveals the relationship between each input feature and output. Then, the constrained particle swarm optimization (PSO) algorithm is employed to optimize the feature weights and the hyperparameters simultaneously. Finally, the optimal weights are used to modify the kernel function. Simulation experiments were conducted on four synthetic datasets and seven real datasets by using the proposed model, classical SVR, and some state-of-the-art feature-weighted SVR models. The results show that the proposed method has the superior generalization ability within acceptable time.


Introduction
Support vector regression (SVR) is widely applied for regression problems due to its ability of converting the original low-dimensional problem to a high-dimensional kernel space linear problem by introducing kernel functions [1,2]. e empirical risk and the confidence interval are balanced by adopting the principle of Structural Risk Minimization [3,4]. SVR showed state-of-the-art performance in system modeling, such as remaining life prediction for lithium-ion batteries, electricity load forecasting, industrial analysis, coal flotation responses prediction, and so on [5][6][7][8][9].
Feature weighting is an effective method to solve the problem that all features are treated equally. It makes the feature importance better matched with its influence on the kernel space. Our previous work [10] has analyzed the necessity of feature weighting and verified it by using the grid search (GS) method to select the optimal combination of weights. Grid search is an exhaustive search method because its computational cost is very high. erefore, an effective identification of weights is needed. e grey correlation degree (GCD) theory was employed to calculate the weights for input features [11,12]. e main idea of GCD is to evaluate the relationship between the input feature and the output of training samples according to the similarity of the geometric shape of sequence curves. e higher the similarity between the curves, the more important the corresponding input feature. However, the importance of the feature whose geometric shape is not similar to the output may be underestimated. Hou et al. developed a weighted kernel function for SVR based on maximal information coefficient (MIC) which is a distinct correlation statistic [13]. e relationship of both linear and nonlinear between input feature and output was evaluated. MIC can accurately identify the most important features of the input dataset. However, it depends too much on the quality and size of training data. Furthermore, it does not guarantee that the obtained weight combination is optimal, especially when MIC is regarded as an optimal weight. Wen et al. proposed a novel attribute feature weighting method called variableweighted support vector machine [14]. In this method, the feature weights and optimal model parameters were both tuned by particle swarm optimization (PSO). e similar methods have also appeared in literature [15,16]. A more flexible rational model can be constructed based on the PSO algorithm. Its further application in feature-weighted SVR is hindered due to the drawbacks of slow convergence and the local optima nevertheless. e paper proposed an efficient feature weighting method for support vector regression. Firstly, the MIC of each feature is calculated to reveal its relationship with output. e larger the value of MIC, the greater its contribution. e value constraint of each weight is obtained according to the MIC. Secondly, an improved constrained PSO is employed to tune the optimal feature weights and model parameters under the abovementioned constraints. Finally, the optimal weights are used to modify the kernel function.
e contributions of the work are two folds. Firstly, regarding the feature weighting with limited training samples, the contribution is hard to be calculated accurately. In addition, there is no evidence showing that the contribution can be regarded as an optimal weight. In the paper, the value constraint of weight is obtained according to the MIC. erefore, the above problems have been avoided. Secondly, both weights and parameters are tuned by a constrained PSO which can converge to the global optima more efficiently to obtain superior generalization ability in the case of acceptable time consumption. e paper is organized as follows: in Section 2, the SVR modeling method is briefly described. In Section 3, firstly, the value constraint of each weight is obtained according to the MIC. en, the constrained PSO algorithm is explicitly illustrated. Simulation examples are given in Section 4. In Section 5, general conclusions are given.

Basic Review of SVR
e training dataset is given as T � (x 1 , y 1 ), . . . , (x l , y l ) ∈ (R m × Y) l , where x i ∈ R m is an input sample with m features and y i ∈ Y � R is the output, i � 1, . . . , l. e SVR model function can be regarded as a hyperplane in the kernel space. It is expressed as follows: where ϕ (x) maps the low-dimensional input features to a high-dimensional kernel space, w ∈ R n is a weight vector of the hyperplane, and b is a bias term. An insensitive loss function ε > 0 is introduced to avoid overfitting, and additional nonnegative slack variables ξ i , ξ * i are adopted to weaken the constraints of some certain sample points. SVR modeling is formulated as a convex quadratic programming problem expressed as follows: subject to where C > 0 is a penalty parameter. e above convex quadratic programming problem can be solved by constructing a Lagrange function. e above convex quadratic programming problem can be reformulated by constructing the Lagrange function: where α i , α * i ≥ 0 and η i , η * i ≥ 0 are Lagrange multipliers. e kernel function K(·, ·), which satisfies the Mercer condition, is introduced to replace the inner product of the high-dimensional kernel space in equation (4). e commonly used kernel function is Gaussian kernel [17]. It is expressed as follows: where c is the kernel parameter that controls the width of the Gaussian kernel. e optimized problem can be expressed as follows: e solution can be obtained as α � (α 1 , α * 1 , . . . , α l , α * l ) T . e model function f(x) can be represented as follows:

Feature-Weighted Support Vector Regression
is drawn on the scatterplot of D j that partitions the data to encapsulate that relationship. Both p and q are positive integers.
For each pair (p, q), the possible largest mutual information achieved by any p − by − q grid and applied to D j is found as follows: en, the mutual information value is normalized to ensure a fair comparison between grids of different dimensions: is defined as a characteristic matrix where M(D j ) p,q is the highest normalized mutual information achieved by any p − by − q grid.
Finally, the maximum value in M is defined as MIC as follows: where B is a function of samples size. It is usually set to B(l) � l 0.6 . When the j th input feature and output are statistically independent, MIC (D j ) tends to 0. When they are entirely correlated, MIC (D j ) tends to 1.

Implementation of the Constrained PSO.
PSO is a widely used optimization tool based on a population, where each member is dubbed as particle. It simulates bird's searching food. Each particle is a potential solution to the optimization problem. Particles fly through the problem space according to simple mathematical formulae which are deduced from their position and velocity.
Each particle's movement is influenced by its personal best-known position (pBest), but it is also guided toward the global best position (gBest), which is found by other particles. Assume the solution space is N-dimensional; the position, the velocity, and the pBest of i th particle are represented as respectively. e gBest is represented as g � (g 1 , g 2 , . . . , g N ). At each iterative process, i th particle updates its velocity and position according to the following formulae: where w ′ is a weighting factor to balance the global search and local search, c 1 and c 2 are both learning factors, r 1 and r 2 are random values between 0 and 1, and μ is a parameter about flying times. In the paper, PSO is employed to search the optimal hyperparameters (C, c, ε) and the optimal combination of feature weights (w 1 , w 2 , . . . , w m ) by minimizing the fitness function. e values of the MIC are used to constrain the weights. With the movement of the particles, both the optimal hyperparameters and the optimal weights will be obtained. e root mean square error (RMSE) is usually employed to evaluate the feasibility of the SVR method. It is selected as fitness function: where y i is the actual output sample and f(x i ) is its corresponding predicted value. e smaller the value of RMSE, the better its generalization ability.

Steps of the Feature-Weighted Support Vector Regression.
Steps of the proposed feature weighting support vector regression (MP-FWSVR) modeling method based on the MIC and the PSO are as follows: Mathematical Problems in Engineering Step 1. MIC between each feature and output is calculated.
Step 2. e values of the MIC are used to constrain the weights as follows: where κ 1 and κ 2 are set to 0.01 and 1, respectively.
Step 3. e i th particle is looked as a point in the (2 + m)-dimensional space and represented as Step 4. Obtain the optimal hyperparameters (C, c, ε) and the optimal combination of feature weights (w 1 , w 2 , . . . , w m ) by using PSO.
Step 5. Regarding the k th feature, if a weight value w k is given, the kernel element will be changed as e model function f(x) can be further developed as follows:

Simulation Examples
In this section, four synthetic datasets and seven real datasets are employed to evaluate the feasibility of the MP-FWSVR. All the codes are compiled on a Windows 10 PC with Intel Core i5-3470 CPU (3.2 GHz) and 12.0 GB RAM by MAT-LAB R2013a. e SVR algorithm is implemented via LIBSVM 3.22 [18].

Experiments on Artificial Synthetic Datasets.
e definitions of four functions are listed in Table 1. σ is the added Gaussian noise with a mean zero and a standard deviation of 0.01. e synthetic dataset "F1" is chosen as an example. e training data of Feature 1 (x i1 ) and Feature 2 (x i2 ) are taken from the sinusoidal functions of 0.01 Hz and 0.05 Hz, respectively.
eir test datasets are extracted from linear functions and the sinusoidal signal 0.125 Hz, respectively. e training size and test size are both 101 × 2. e two datasets are shown in Figure 1.
Firstly, we compare the optimal weights obtained by proposed MP algorithm with GS, MIC, and PSO algorithms. For MP and PSO, the parameters w ′ � 0.6, c 1 � 1.5, c 2 � 1.7, and μ � 1. e number of particles is set to 20, and the iteration count is 200. e searching range of hyper- 10 ], and [10 − 3 , 1], respectively. For GS and MIC, the optimal weights (w 1 , w 2 ) of Feature 1 and Feature 2 are (0.01, 0.0001) and (0.9496, 0.2461), respectively. For MP and PSO, each method is repeatedly implemented 20 times. e optimal weights are shown as the box charts in Figure 2.
As can be seen from Figure 2, the variation in w 1 obtained by the PSO method is larger than that of the MP method. Moreover, the values of w 2 optimized by PSO are all 0 which will cause the feature to be ignored during model training. According to [10], the weights combination (0.01, 0.0001) searched by GS is effective for model training. e mean weight values (0.0143, 0.0001) and (0.0219, 0.0000) obtained by MP and PSO, respectively, are closer to the GS method than the MIC method. en, the MP algorithm is compared with MIC and PSO algorithms to observe the differences in the generalization ability. e near-average weights in Figure 2 of MP and PSO are selected. e results are shown in Table 2.
As shown in Table 2, the MP algorithm achieves the best generalization performance compared with the other two feature weighting methods. MIC can reveal the feature's relationship with output. However, it does not consider the amplitude of the features. e amplitude ratio of MIC is only 1.2862, far less than 116.2602 of MP algorithm, which means that the influence of Feature 2 with a low contribution to the output is not obviously weakened. PSO can obtain both weights and parameters. However, the weight is not constrained to a reasonable interval. In Table 2, w 2 is set to 0 which is obviously unreasonable.
In order to facilitate the comparison, the predicted outputs for the test set are shown in Figure 3. Figure 3 indicates that the output curve of MP-FWSVR is almost coincident with the true function. e prediction curve of MIC-FWSVR is quite different from the real output. e PSO-FWSVR achieves better results than MIC-FWSVR, but underfitting occurs because Feature 2 is not involved in model training due to w 2 � 0. Finally, we model four synthetic datasets in Table 3 to compare the proposed MP-FWSVR with GS-FWSVR [10], classical SVR, MIC-FWSVR [13], and PSO-VWSVR [14]. For classical SVR, the min-max normalization is employed to convert raw data to [0, 1] by linearization. e k-th feature of the i-th sample x ik is normalized to x ik ′ : where x k max , x k min is the maximum and minimum value of the k-th feature, respectively. Each program is repeatedly implemented 10 times. e mean values and standard deviation of the RMSE are shown in Table 3. According to Table 3, MP-FWSVR achieves a competitive generalization performance. Compared with the classical SVR, MIC-FWSVR, and PSO-FWSVR, MP-FWSVR achieves the best generalization performance on almost all four datasets. Compared with the GS-FWSVR, only one optimal result is obtained by MP-FWSVR. However, what needs to be noticed is that other three results are close to the GS-FWSVR ones in the case of much less training time consumption.

Experiments on Benchmark Datasets.
We randomly choose seven UCI benchmark datasets [19], and the results are compared as in Table 4.
According to Table 4, two optimal results and three suboptimal results are obtained by MP-FWSVR. Compared with the MIC-FWSVR, MP-FWSVR achieves better results in 6 of the 7 datasets. Compared with the PSO-FWSVR, MP-FWSVR achieves better results in 5 of the 7 datasets.
To summarize, the Wilcoxon signed rank tests [20] at the 0.05 significance level are implemented to test the differences       Tables 3 and 4. e test results are presented in Table 5. e overall results obtained by the Wilcoxon tests presented in Table 5 show that there is significant difference among MP-FWSVR, classical SVR, and MIC-FWSVR. MP-FWSVR achieves the generalization ability which is more approximate to the GS-FWSVR than the PSO-FWSVR method when the time consumption is acceptable.

Conclusions
In the paper, we propose a feature-weighted SVR method named MP-FWSVR which integrates the advantages of MIC and PSO algorithms to solve the time-consumption problem of the GS-FWSVR algorithm proposed in our previous work. e contribution of each input feature is expressed as MIC which is used to constrain the value of feature weight in the PSO optimization of the weight combination. Numerical experiments show the effectiveness of the proposed algorithm. Our future work will continue to focus on the feature weighting algorithm to get better generalization ability.

Data Availability
In the paper, four synthetic datasets and seven real datasets are employed to evaluate the feasibility of the MP-FWSVR. We randomly chose seven UCI benchmark datasets from http://archive.ics.uci.edu/ml/index.php.

Conflicts of Interest
e authors declare that there are no conflicts of interest.