1. Introduction

TSWJ

The Scientific World Journal

1537-744X 2356-6140

Hindawi Publishing Corporation

10.1155/2014/851814

851814

Research Article

Density-Based Penalty Parameter Optimization on C-SVM

Liu

Yun

¹ Lian

Jie

¹ Bartolacci

Michael R.

² Zeng

Qing-An

³ Chao

Han-Chieh

Key Laboratory of Communication and Information Systems

Beijing Municipal Commission of Education

Beijing Jiaotong University

Beijing 100044

China

njtu.edu.cn

Information Sciences and Technology

Penn State University-Berks, Reading

PA 19610

USA

psu.edu

Department of Electronics

Computer and Information Technology

North Carolina Agricultural and Technical State University

Greensboro, NC 27411

USA

ncat.edu

2014

772014

2014 29 04 2014 04 06 2014 7 7 2014

2014

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

The support vector machine (SVM) is one of the most widely used approaches for data classification and regression. SVM achieves the largest distance between the positive and negative support vectors, which neglects the remote instances away from the SVM interface. In order to avoid a position change of the SVM interface as the result of an error system outlier, C-SVM was implemented to decrease the influences of the system’s outliers. Traditional C-SVM holds a uniform parameter C for both positive and negative instances; however, according to the different number proportions and the data distribution, positive and negative instances should be set with different weights for the penalty parameter of the error terms. Therefore, in this paper, we propose density-based penalty parameter optimization of C-SVM. The experiential results indicated that our proposed algorithm has outstanding performance with respect to both precision and recall.

1. Introduction

Data classification algorithms, such as logistic regression (LR) [1–6] and support vector machine (SVM) [7–10], are crucial in many applications. SVM is a local optimum classification which pursues a maximum interval interface using a loss of the distance from the remote instances to the SVM interface [11–13]. The discriminant equation of the SVM model can be written as (1) Y = g ( X ; ω ) = { 1 ω T X + b > 0 - 1 ω T X + b < 0 , where X denotes an eigenvector of an arbitrary instance input and x i is a concrete feature in an eigenvector in which X = { x 1 , x 2 , … , x m } . The model is trained with all positive instances of labels for which Y = 1 and the negative instances are trained with label Y = - 1 in order to pursue the appropriate values for the parameters ω and b . Thus, for an unknown instance X i , it will be classified to a positive case when ω T X i + b > 0 , and vice versa.

Traditional SVM guarantees a strict classification that the classification model are constructed by the positive vectors with ω T X i + b = 1 , the negative vectors with ω T X i + b = - 1 and the SVM interface that ω T X i + b = 0 . Since all the positive instances hold the distances ≥1 and negative instances ≤1, this leads to the following problems: ( 1 ) in many datasets, positive and negative instances are interlaced which can not be classified under a regular kernel function; ( 2 ) a meticulous training may cause an overfitting phenomenon which to the maximum extent satisfies the classification in the training set by sacrificing the systematic performance for the data in the probe set; and ( 3 ) overtraining usually costs more computation. In order to solve these shortcomings, C-SVM is introduced to improve the adaptability of the traditional SVM model [14, 15]. In C-SVM model, coefficient C is used to control the tolerance of the systematic outliers which allows less outliers to exist in the opponent classification. Coefficient C is an empirical parameter which is usually worked out via a gird search process. C-SVM holds a uniform C for both positive instances and negative ones, which only satisfies the datasets with the similar distributions of each class. In LIBSVM, C-SVM model is improved by the number proportion of the positive instances to the negative ones [13, 14]; however the spatial distribution of the initial instances has not been involved in the model training process. In this paper, we aim to provide a better solution of the value of parameter C , thus under the same conditions it can achieve a relatively accurate classification result.

2. Traditional Model of SVM

According to (1), since Y is positive (or negative) when ω T X i + b > 0 (or ω T X i + b < 0 ), | ω T X i + b | can be denoted by s = Y i ( ω T X i + b ) , where s is the distance between an arbitrary instance ( X i , Y i ) and the SVM interface. When seeking the appropriate ω and b in order to maximize the distance between the support vectors and the SVM interface, on the proportionally scale, the distance s will not change the values of ω and b . Thus, s can be presented by (2) s = Y i ( ( ω ∥ ω ∥ ) T X i + b ∥ ω ∥ ) .

Normalizing the geometric interval to ∥ ω ∥ = 1 , (2) can be subjected to (3) max ⁡ ω , b ⁡ s s . t . Y i ( ω T X i + b ) ≥ s i = 1,2 , 3 , … , n ∥ ω ∥ = 1 .

Since ∥ ω ∥ = 1 is not convex, for the ma x ω , b s / ∥ ω ∥ , (3) is subject to (4) min ⁡ ω , b ⁡ 1 2 ∥ ω ∥ 2 s . t . Y i ( ω T X i + b ) ≥ 1 i = 1,2 , 3 , … , n .

Computing the minimum ∥ ω ∥ 2 under the condition of Y i ( ω T X i + b ) ≥ 1 , the Lagrangian function can be imported by (5) L ( α , ω , b ) = 1 2 ∥ ω ∥ 2 + ∑ i = 1 n α i ( 1 - Y i ( ω T X i + b ) ) .

The minimum L ( α , ω , b ) can be acquired by the derivation of the parameters ω and b such that (6) ∂ L ( α , ω , b ) ∂ ω = ω - ∑ i = 1 n α i Y i X i = 0 ⟹ ω = ∑ i = 1 n α i Y i X i ∂ L ( α , ω , b ) ∂ b = ∑ i = 1 n α i Y i = 0 .

Integrating (5) and (6), we finally obtain (7) L ( α , ω , b ) = 1 2 ω T ∑ i = 1 n α i Y i X i + ∑ i = 1 n α i - ∑ i = 1 n α i Y i ω T X i - ∑ i = 1 n α i Y i b = - 1 2 ω T ∑ i = 1 n α i Y i X i + ∑ i = 1 n α i - b ∑ i = 1 n α i Y i = ∑ i = 1 n α i - 1 2 ( ∑ i = 1 n α i Y i X i ) T ∑ i = 1 n α i Y i X i - b ∑ i = 1 n α i Y i = ∑ i = 1 n α i - 1 2 ∑ i = 1 , j = 1 n α i Y i X i T α j Y j X j - b ∑ i = 1 n α i Y i = ∑ i = 1 n α i - 1 2 ∑ i = 1 , j = 1 n Y i Y j α i α j X i T X j - b ∑ i = 1 n α i Y i .

Combining (5) and (7), we get (8) L ( α , ω , b ) = ∑ i = 1 n α i - 1 2 ∑ i = 1 , j = 1 n Y i Y j α i α j X i T X j .

In (8), the value of L ( α , ω , b ) is only related to the parameter α . The training process of α can be solved by the sequential minimal optimization (SMO) algorithm [16–18].

3. C-SVM on the Penalty Parameter of the Error Term

The selection of the SVM interface is determinate according to the distribution of the support vectors. This means that a slight position change of one single support vector could lead to an obvious movement of the SVM interface. In another situation, if there is an instance in which a system outlier exists in the area of the opposite class, the SVM interface must be inflected so that it will no longer generate accurate classification results. Therefore, an error term is introduced to tolerate some erroneous instances in the opponent classification. In the C-support vector machine (C-SVM) model [19, 20], we use a nonnegative parameter ς i , for example, slack error term, which enables the geometric interval s < 1 between some erroneous instances and the SVM interface, according to (2). Slackening the restriction, we must rebuild the constraint function for the penalty of the outliers: (9) min ⁡ ω , b ⁡ 1 2 ∥ ω ∥ 2 + C ∑ i = 1 n ς i s . t . Y i ( ω T X i + b ) ≥ 1 - ς i i = 1,2 , 3 , … , n ς i ≥ 0 i = 1,2 , 3 , … , n .

In (9), coefficient C is the penalty parameter of the error term, which is used to control the tolerance of the systematic outliers. A larger C value allows less outliers to exist in the opponent classification, or vice versa. Utilizing the Lagrangian function to calculate the extremum of (9), (5) can be rebuilt by (10) L ( α , ω , b , ς ) = 1 2 ∥ ω ∥ 2 + ∑ i = 1 n α i ( 1 - Y i ( ω T X i + b ) - ς i ) + C ∑ i = 1 n ς i - ∑ i = 1 n β i ς i .

In (10), parameters α i and β i are the Lagrangian factors for the training instances and the systematic outliers, respectively. The extremum of L ( α , ω , b , ς ) can be acquired in correspondence with (6). Since C and β are not related to ω and b for SVM model, (9) can be subject to (11) max ⁡ α ⁡ W ( α ) = ∑ i = 1 n α i - 1 2 ∑ i = 1 , j = 1 n Y i Y j α i α j 〈 X i , X j 〉 s . t . 0 ≤ α i ≤ C i = 1,2 , 3 , … , n ∑ i = 1 n α i Y i = 0 .

When calculating (11) via the SMO process, one can adjust only two α i at each iteration and consider the rest as the constants until it satisfies all Karush-Kuhn-Tucker (KKT) conditions [21–23]: (12) α 1 Y 1 + α 2 Y 2 = - ∑ i = 3 n α i Y i = ξ .

The output Y is labeled with +1 or −1 as the positive or the negative instance. Thereby, when Y 1 Y 2 = - 1 , (12) can be regarded as a line with gradient of 1: ( α 1 - α 2 = ξ or α 2 - α 1 = ξ ). When Y 1 Y 2 = 1 , it can be regarded as a line with gradient of −1: ( α 2 + α 1 = ξ or α 1 + α 2 = - ξ ). When adjusting α 1 and α 2 , the value of the parameters should satisfy the functions of the lines according to Figure 1. Meanwhile, they must be restricted within the square with length C , where C is the penalty parameter of the error term in (9). Therefore, when Y 1 Y 2 = - 1 , (13) L = max ⁡ ( 0 , α 2 - α 1 ) H = min ⁡ ( C , C + α 2 - α 1 ) ;

Parameter training process of SMO. (a) Adjustment of the parameter when Y1Y2 = 1. (b) Adjustment of the parameter when Y1Y2 = −1.

(a) (b)

otherwise, (14) L = max ⁡ ⁡ ( 0 , α 2 + α 1 - C ) H = min ⁡ ( C , α 2 + α 1 ) .

Continue the SMO process; set K i j = 〈 X i , X j 〉 : (15) α 2 new = Y 2 ( Y 2 - Y 1 + Y 1 ξ ( K 11 - K 12 ) + v 1 - v 2 ) K 11 + K 22 - 2 K 12 (16) α 2 new = α 2 old + Y 2 ( E 1 - E 2 ) η .

In (16), E i is the dissimilarity between the real value of the model v i = ω T X i + b in ( - ∞ , + ∞ ) and the output of Y i in [ + 1 , - 1 ] . By definition of K i j , η equals the square of the distance of the vectors; that is, η = ∥ X i - X j ∥ 2 . As vector X i follows a certain distribution, η is a constant in (16). The training process of the Lagrangian factors α 1 and α 2 is calculated by (15) and (16), and it is limited by (13) and (14). Integrated with KKT conditions, the final training process of α 2 can be demonstrated by (17) α 2 new , clipped = {    L α 2 new ≤ L    α 2 new L < α 2 new < H    H α 2 new ≥ H , where α 1 = ( ξ - α 2 Y 2 ) Y 1 , α 1 old = ξ - Y 1 Y 2 α 2 old , and α 1 new = ξ - Y 1 Y 2 α 2 new , clipped . The training process of α 1 can be finalized by (18) α 1 new = α 1 old + Y 1 Y 2 ( α 2 old - α 2 new , clipped ) .

The training process stops when all α i values satisfy the KKT conditions: (19) a i = 0 ⟺ Y i v i ≥ 1 0 < a i < C ⟺ Y i v i = 1 a i = C ⟺ Y i v i ≤ 1 .

4. Optimization of the Penalty Parameter of the Error Term

Since there is no theoretical selection of the penalty parameter of the error term, grid-search is recommended on the value of C using cross-validation. Once the appropriate C is determined (e.g., C = 2 - 5 , 2 - 3 , … , 2 15 ), the same value must be implemented on both positive and negative instances.

Hypothesis 1.

In Figure 2, the red dots represent positive instances, and the blue diamonds represent negative instances. Assume four instances (two positive and two negative) are outliers, which are circled by the black ellipses. The following will happen: since there is a large number of positive instances, deleting two of them as the support vectors may not change the position of the SVM interface. However, the same phenomenon does not occur with the negative instances. Deleting two negative support vectors will produce an obvious change in the position of the SVM interface. Thus, an unbeknown instance represented by a black dot will be erroneously classified to the positive set, which should have belonged to the negative set if C were not implemented in the SVM model.

Figure 2

A heterogeneous distribution of the initial instances for C-SVM, Hypothesis 1: the number of the positive instances is much greater than the number of negative instances.

According to the analysis above, we provide different values of C for positive instances and negative instances instead of a constant value of the penalty parameter for all nodes. Thus, (9) can be improved by (20) min ⁡ ω , b ⁡ 1 2 ∥ ω ∥ 2 + C + ∑ i = 1 l ς i + C - ∑ i = l + 1 l + m ς i s . t . Y i ( ω T X i + b ) ≥ 1 - ς i i = 1,2 , 3 , … , n ς i ≥ 0 i = 1,2 , 3 , … , n .

In (20), l presents all positive instances, and m denotes the negative instances. Since the positive instances can tolerate more system outliers due to the large number of instances, C + can be assigned a smaller value than C - .

Hypothesis 2.

In Figure 3, the number of positive instances is equal to the number of negative instances, but the negative instances can tolerate more system outliers due to the initial distribution of the data.

Figure 3

Heterogeneous distribution of the initial instances for C-SVM, Hypothesis 2: when the number of positive instances is equal to the number of negative instances, the positive instances account for a larger area.

Hypothesis 3.

In Figure 4, the number of positive instances is even larger than the number of negative instances, but the penalty parameter for the positive instances can be stricter than for the negative instances. Therefore, C + must be assigned a larger value in Hypothesis 2 than in Hypothesis 3.

Figure 4

Heterogeneous distribution of the initial instances for C-SVM, Hypothesis 3: when the number of positive instances is larger than the number of negative instances, the positive instances account for a larger area.

Integrated with all the hypotheses, we find that the proportion of C + and C - is relevant to the number of positive and negative instances and the distribution of the initial data samples. Therefore, we propose a density-based, penalty parameter optimization of the error term: (21) D + = | max ⁡ ( ( ω T X i + b ) / ∥ ω ∥ ) - min ⁡ ( ( ω T X i + b ) / ∥ ω ∥ ) | l D - = | min ⁡ ( ( ω T X i + b ) / ∥ ω ∥ ) - max ⁡ ( ( ω T X i + b ) / ∥ ω ∥ ) | m .

In (21), D + and D - present the sample density of the positive instances and the negative instances, respectively. The larger the value of D is, the smaller the sample density is, and, thus, a smaller C can be assigned. According to Figure 5, the density of the corresponding instances is decided by the distance between the remotest node and the nearest node from the SVM interface divided by the number of instances.

Figure 5

Density-based, parameter-weights optimization on the distribution of a heterogeneous dataset for C-SVM.

5. Experiments

We chose a dataset from the official website of LIBSVM, which contains many classifications, regressions, and multilabel datasets stored in LIBSVM format. Many are from UCI, Statlog, StatLib, and other collections [24]. The data groups used in our experiments are listed in Table 1.

Table 1

Standard dataset for classification.

Name	Type	Class	Training size	Testing size	Feature
a1a	Classification	2	1,605	30,956	123
a2a	Classification	2	2,265	30,296	123
a3a	Classification	2	3,185	29,376	123
a4a	Classification	2	4,781	27,780	123
w1a	Classification	2	2,477	47,272	300
w2a	Classification	2	3,470	46,279	300
w3a	Classification	2	4,912	44,837	300
w4a	Classification	2	7,366	42,383	300

In order to evaluate the accuracy of our proposed algorithm, we optimized the C-SVM model based on LIBSVM tools [14] using linear kernel function ω T X + b . The comparative tests are set by ( 1 ) the uniform C for both instances as traditional C-SVM, ( 2 ) the C + and C - that correspond to the ratio of the positive instance number and the negative instance number, and ( 3 ) the C + and C - that correspond to our proposed, density-based, penalty parameter optimization.

Aiming at testing whether the proposed algorithm has a positive performance under all circumstances, we simply assigned C the values of 0.5, 1, 10, 50, and 100 instead of doing the grid-search. In our proposed optimization, (21) can provide only the proportions of C + and C - , but not the exact values. Therefore, we used (22) C + = C + Δ ( C = 0.5,1 , … , 100 ) C - = C - Δ ( C = 0.5,1 , … , 100 ) C + C - = D + D - .

For comparative test 2, the proportion of the C + and C - was decided by the number of positive instances N + and the number of negative instances N - : (23) C + = C + Δ ( C = 0.5,1 , … , 100 ) C + = C + Δ ( C = 0.5,1 , … , 100 ) C + C - = N - N + .

In our proposed algorithm, the SVM interface is unknown before the classification. In order to calculate the density of the corresponding class, we first implement a traditional C-SVM and confirm the position of the SVM interface. In this way, the densities of the positive and negative instances can be computed via (21), and then C + and C - eventually can be determined by (23).

We evaluated the accuracy of our proposed algorithm via precision, recall, and F -measure. The precision rate was the number of correctly classified instances divided by the number of total instances. Table 2 shows the experimental results of the precision rate of the different algorithms for different C values, where a1a-1, a1a-2, and a1a-3 present the traditional C-SVM, improved C-SVM on number proportion (23), and our proposed, density-based C-SVM (22), respectively.

Table 2

Systematic precision at different parameter values (%).

Model	C = 0.5	C = 1	C = 10	C = 50	C = 100
a1a-1	84.03	83.82	83.77	83.69	83.64
a1a-2	77.41	78.65	78.56	78.65	78.72
a1a-3	83.92	83.31	83.26	83.11	83.10
a2a-1	84.60	84.28	84.02	83.98	83.95
a2a-2	76.89	77.27	77.26	77.19	77.20
a2a-3	84.47	84.30	84.04	83.92	83.88
a3a-1	84.50	84.32	84.08	84.07	84.07
a3a-2	77.53	77.88	77.89	77.83	77.80
a3a-3	84.37	84.35	84.11	84.08	84.07
a4a-1	84.29	84.25	84.18	84.06	84.07
a4a-2	78.49	79.22	79.13	79.20	79.16
a7a-3	83.96	84.08	84.12	84.23	84.11
w1a-1	97.56	97.74	97.46	96.75	96.56
w1a-2	95.10	96.10	94.94	94.25	94.66
w1a-3	97.62	97.84	97.53	97.34	97.21
w2a-1	97.86	98.07	97.53	97.27	97.07
w2a-2	94.88	96.03	95.78	94.86	94.90
w2a-3	98.12	98.21	97.92	97.63	97.58
w3a-1	97.83	98.29	98.02	97.84	97.83
w3a-2	95.27	96.22	96.01	95.51	95.62
w3a-3	97.91	98.24	98.36	98.07	98.02
w4a-1	98.01	98.39	98.26	98.07	97.95
w4a-2	95.58	96.52	96.50	96.15	96.22
w4a-3	98.01	98.42	98.38	98.27	98.26

Recall rate indicates the number of the right classified positive instances by the number of the total positive instances in the testing set. Table 3 shows the experimental results of the recall rates provided by the different algorithms different values of C .

Table 3

Systematic recall for different parameter values (%).

Model	C = 0.5	C = 1	C = 10	C = 50	C = 100
a1a-1	60.07	60.60	61.46	61.50	61.46
a1a-2	87.21	85.87	85.14	84.99	84.97
a1a-3	73.02	72.73	72.16	71.54	71.52
a2a-1	59.17	62.04	62.29	62.61	62.77
a2a-2	88.10	87.54	86.34	86.22	86.19
a2a-3	79.37	79.22	78.94	78.73	78.56
a3a-1	57.64	60.58	60.70	60.70	60.71
a3a-2	87.28	86.39	85.84	85.67	85.75
a3a-3	78.93	79.01	78.54	78.33	78.28
a4a-1	58.39	60.20	60.90	60.90	60.92
a4a-2	86.61	85.69	85.57	85.60	85.65
a4a-3	80.01	79.56	79.34	79.42	79.38
w1w-1	19.97	50.39	49.82	45.98	46.41
w1w-2	66.95	59.51	50.82	47.46	47.69
w1w-3	70.42	70.13	68.24	67.03	67.12
w2w-1	31.41	51.24	56.27	57.14	56.49
w2w-2	73.03	70.12	65.01	61.52	60.64
w2w-3	72.76	72.53	71.88	72.02	70.93
w3w-1	29.42	53.97	58.01	56.59	58.23
w3w-2	77.47	73.05	68.34	65.64	63.47
w3w-3	77.12	76.87	76.31	75.54	75.65
w4w-1	36.18	55.50	60.73	61.84	62.07
w4w-2	77.28	73.71	70.31	68.17	67.46
w4w-3	77.37	76.52	76.43	75.12	74.83

Table 1 indicates that the size of the testing set of w1a was 47,272, which was composed of 1,407 positive instances and 45,865 negative instances. For this distribution, we predict all unknown inputs as the negative instances. In this way, all of the 45,865 negative instances can be classified correctly with the precision of 97.02%. Therefore, the recall rate is of great importance as a supplementary measure. In a1a, a2a, a3a, and as a4a datasets, the size of the negative instances is about double that of the positive instances. Method 2 (number proportion-based optimization) sacrifices the precision rate in an acceptable range, but it improves the recall rate in a large scale. In Method 3 (our proposed density-based C-SVM), 12 groups in 20 experiments had slightly decreased precision rates, while the other eight groups successfully enhanced it. All 20 experiments by Method 3 improved the recall rate, but not to the extent that Method 2 did.

In w1a, w2a, w3a, and w4a datasets, the size of the negative instances was many times greater than that of the positive instances. Our proposed method indicated that there were obvious advantages in both precision rate and recall rate. Traditional C-SVM has a high precision rate, but it performs poorly with respect to recall rate. Method 2 improved the recall performance and decreased the precision rate, which was similar to the findings of previous experiments. Method 3 enhanced the precision rate to a greater extent than traditional C-SVM, and it simultaneously improved the recall rate over that of Method 2.

The F -measure is a comprehensive evaluation of both precision and recall. In (24), beta is the parameter that adjusts the weights between the precision rate and the recall rate. When we consider precision more important, the value of beta should be > 1. On the contrary, in some cases, such as alarming or warning, the recall rate is significant in determining all of the potential risks. Thus, the value of beta should be < 1: (24) F -measure = ( ( beta ) 2 + 1 ) × PRE × REC ( beta ) 2 × PRE + REC .

Table 4 provides the evaluation results by F -measure with beta = 1 . Figures 6 and 7 explicitly demonstrate that the comparisons among M-1 (traditional C-SVM), M-2 (number proportion-based C-SVM optimization), and M-3 (density-based C-SVM optimization). Each statistical result was obtained by the average of one certain data group for C = 0.5, 1, 10, 50, and 100.

Table 4

Systematic F-measure at different parameter values (%).

Model	C = 0.5	C = 1	C = 10	C = 50	C = 100
a1a-1	70.06	70.34	70.90	70.90	70.85
a1a-2	82.02	82.10	81.72	81.70	81.73
a1a-3	78.09	77.66	77.31	76.89	76.88
a2a-1	69.63	71.47	71.54	71.74	71.84
a2a-2	82.12	82.09	81.55	81.45	81.44
a2a-3	81.84	81.68	81.41	81.24	81.13
a3a-1	68.53	70.51	70.50	70.50	70.51
a3a-2	82.11	81.91	81.67	81.56	81.58
a3a-3	81.56	81.59	81.23	81.10	81.07
a4a-1	69.09	70.29	70.78	70.77	70.79
a4a-2	82.35	82.33	82.22	82.28	82.27
a4a-3	81.94	81.76	81.66	81.75	81.68
w1w-1	33.16	66.50	65.94	62.34	62.70
w1w-2	78.58	73.81	66.20	63.40	63.43
w1w-3	81.82	81.70	80.30	79.39	79.41
w2w-1	47.56	67.31	71.37	71.99	71.42
w2w-2	82.53	81.05	77.45	74.63	74.00
w2w-3	83.56	83.44	82.90	82.89	82.15
w3w-1	45.23	69.68	72.88	71.70	73.01
w3w-2	85.45	83.05	79.85	77.81	76.30
w3w-3	86.28	86.25	85.94	85.34	85.39
w4w-1	52.85	70.97	75.06	75.85	75.99
w4w-2	85.46	83.59	81.35	79.78	79.31
w4w-3	86.48	86.10	86.03	85.15	84.96

Figure 6

Comparison of the F -measure among traditional C-SVM (M-1), number proportion-based optimization (M-2), and density-based optimization (M-3) via datasets in which the size of the negative instances was several times greater than that of the positive instances.

Figure 7

Figure 6 shows datasets a1a, a2a, a3a, and a4a, in which the size of the negative instances was several times greater than that of the positive instances. Both M-2 and M-3 can generate better F -measure evaluations than M-1, traditional C-SVM. Concerning the F -measure, M-2 performs even better, but, in doing so, systematic precision was sacrificed in order to achieve better recall. Our proposed M-3 minimizes the losses of systematic precision and evidently enhances the F -measure to a greater extent than M-1.

Figure 7 shows datasets w1a, w2a, w3am, and w4a, in which the size of the negative instances is far greater than that of the positive instances. M-3 had the best results for precision, recall, and F -measure. Therefore, for the given data distribution, our proposed density-based C-SVM optimization provided a remarkable advantage for the classification of data.

6. Conclusions

In this paper, we presented density-based penalty parameter optimization in C-SVM algorithm. In traditional C-SVM, as the penalty parameter of the error term, C is used to control the tolerance of the systematic outliers. A larger value of C allows less outliers to exist in the opponent classification. Grid-search is generally implemented in the computation of the values of C . In order to enhance the accuracy of the algorithm, LIBSVM sets different values of C for positive and negative slack error terms based on the number proportion of the positive and negative instances. The principle of number proportion-based C-SVM optimization is that the weight of each instance is decided by the possibility that this instance itself is a system outlier and by the extent to which it will lead the change in the position of the SVM interface. Motivated by this idea, our proposed density-based penalty parameter optimization is more integrated consideration that includes the sizes of the positive and negative instances and takes the distribution of those instances into account. We implemented our experiments in the standard datasets for classifications. The results of the evaluation indicated that number proportion-based C-SVM optimization normally deserves a better F -measure, but it enhances the systematic recall in a large scale while simultaneously decreasing the systematic precision. Compared with number proportion-based C-SVM optimization, our proposed density-based method improved the systematic recall and maintained systematic precision according to traditional C-SVM. Our proposed density-based method demonstrated outstanding performance on both precision and recall, especially for datasets in which the number of negative instances was far greater than the number of positive instances.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This research was supported by the National Natural Science Foundation under Grant 61371071, Beijing Natural Science Foundation under Grant 4132057, Beijing Science and Technology Program under Grant Z121100007612003, and the Academic Discipline and Postgraduate Education Project of the Beijing Municipal Commission of Education.

Hosmer

J. R.

Jr. David

Stanley

Rodney

Applied Logistic Regression 2013

New York, NY, USA

John Wiley & Sons

Darroch

J. N.

Ratcliff

Generalized iterative scaling for log-linear models

Annals of Mathematical Statistics 1972 43 1470 1480

10.1214/aoms/1177692379

MR0345337

ZBLl0251.62020

Lin

C.-J.

Weng

R. C.

Keerthi

S. S.

Trust region Newton methods for large-scale logistic regression

Proceedings of the 24th International Conference on Machine Learning (ICML '07)

June 2007

Corvallis, Ore, USA

561 568

10.1145/1273496.1273567

2-s2.0-34547982357

Lin

C. J.

J. J.

Newton's method for large bound-constrained optimization problems

SIAM Journal on Optimization 1999 9 4 1100 1127

10.1137/S1052623498345075

MR1724778

2-s2.0-0033436056

Mangasarian

O. L.

A finite Newton method for classification

Optimization Methods & Software 2002 17 5 913 929

10.1080/1055678021000028375

MR1953825

2-s2.0-0036817951

Nash

S. G.

A survey of truncated-Newton methods

Journal of Computational and Applied Mathematics 2000 124 1-2 45 59

10.1016/S0377-0427(00)00426-X

MR1803293

ZBLl0969.65054

2-s2.0-0034538773

Kao

W. C.

Chung

K. M.

Sun

C. L.

Lin

C. J.

Decomposition methods for linear support vector machines

Neural Computation 2004 16 8 1689 1704

10.1162/089976604774201640

2-s2.0-3042616550

Joachims

Training linear SVMs in linear time

Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '06)

August 2006

217 226

2-s2.0-33749563073

Schölkopf

Burges

C. J. C.

Smola

A. J.

Advances in Kernel Methods—Support Vector Learning 1998

Cambridge, Mass, USA

MIT Press

Shalev-Shwartz

Singer

Srebro

Pegasos: primal estimated sub-Gradient solver for SVM

Proceedings of the 24th International Conference on Machine Learning (ICML '07)

June 2007

Corvallis, Ore, USA

807 814

10.1145/1273496.1273598

2-s2.0-34547964973

Boser

B. E.

Guyon

I. M.

Vapnik

V. N.

Training algorithm for optimal margin classifiers

Proceedings of the 5th Annual Workshop on Computational Learning Theory

July 1992

144 152

2-s2.0-0026966646

Hsieh

C. J.

Chang

K. W.

Lin

C. J.

Keerthi

S. S.

Sundararajan

A dual coordinate descent method for large-scale linear SVM

Proceedings of the 25th International Conference on Machine Learning

July 2008

408 415

2-s2.0-56449086680

Fan

R.-E.

Chang

K.-W.

Hsieh

C.-J.

Wang

X.-R.

Lin

C.-J.

LIBLINEAR: a library for large linear classification

Journal of Machine Learning Research 2008 9 1871 1874

2-s2.0-50949133669

Chang

C. C.

Lin

C. J.

LIBSVM: a library for support vector machines

2001, http://www.csie.ntu.edu.tw/~cjlin/libsvm/

Guan

X. H.

Zan

Network intrusion detection based on support vector machine

Journal of Computer Research and Development 2003 6 799 807

Platt

Sequential minimal optimization: a fast algorithm for training support vector machines

Advances in Kernel Methods: Support Vector Learning 1998

Platt

J. C.

Fast Training of Support Vector Machines using Sequential Minimal Optimization 1999

Cambridge, Mass, USA

MIT Press

Cao

L. J.

Keerthi

S. S.

Ong

C. J.

Zhang

J. Q.

Periyathamby

X. J.

Lee

H. P.

Parallel sequential minimal optimization for the training of support vector machines

IEEE Transactions on Neural Networks 2006 17 4 1039 1049

10.1109/TNN.2006.875989

2-s2.0-33746869623

Guan

X. H.

Zan

Han

C. Z.

Network intrusion detection based on support vector machine

Journal of Computer Research and Development 2003 6 799 807

Liu

Jia

C. Y.

A new weighted support vector machine with GA-based parameter selection

Proceedings of the IEEE International Conference on Machine Learning and Cybernetics

2005

Kuhn

The Karush-Kuhn-Tucker Theorem 2006

CDSEM Uni Mannheim

Jiang

Semismooth Karush-Kuhn-Tucker equations and convergence analysis of Newton and quasi-Newton methods for solving these equations

Mathematics of Operations Research 1997 22 2 301 325

10.1287/moor.22.2.301

MR1450794

2-s2.0-0031140591

Bach

F. R.

Lanckriet

G. R. G.

Jordan

M. I.

Multiple kernel learning, conic duality, and the SMO algorithm

Proceedings of the 21st International Conference on Machine Learning (ICML '04)

July 2004

41 48

2-s2.0-14344252374

Hsu

C. W.

Chang

C. C.

Lin

C. J.

A practical guide to support vector classification

2003, https://www.cs.sfu.ca/people/Faculty/teaching/726/spring11/svmguide.pdf