Cost-Sensitive Support Vector Machine Using Randomized Dual Coordinate Descent Method for Big Class-Imbalanced Data Classification

Cost-sensitive support vector machine is one of the most popular tools to deal with class-imbalanced problem such as fault diagnosis. However, such data appear with a huge number of examples as well as features. Aiming at class-imbalanced problem on big data, a cost-sensitive support vector machine using randomized dual coordinate descent method (CSVM-RDCD) is proposed in this paper. The solution of concerned subproblem at each iteration is derived in closed form and the computational cost is decreased through the accelerating strategy and cheap computation. The four constrained conditions of CSVM-RDCD are derived. Experimental results illustrate that the proposed method increases recognition rates of positive class and reduces average misclassification costs on real big class-imbalanced data.


Introduction
The most popular strategy for the design of classification algorithms is to minimize the probability of error, assuming that all misclassifications have the same cost and classes of dataset are balanced [1][2][3][4][5][6].The resulting decision rules are usually denoted as cost insensitive.However, in many important applications of machine learning, such as fault diagnosis [7] and fraud detection, certain types of error are much more costly than others.Other applications involve significantly class-imbalanced datasets, where examples from different classes appear with substantially different probability.Cost-sensitive support vector machine (CSVM) [2] is one of the most popular tools to deal with class-imbalanced problem and unequal misclassification problem.However, in many applications, such data appear with a huge number of examples as well as features.
In this work we consider the cost-sensitive support vector machine architecture [1].Although CSVMs are based on a very solid learning-theoretic foundation and have been successfully applied to many classification problems, it is not well understood how to design big data learning of the CSVM algorithm.CSVM usually maps training vectors into a high dimensional space via a nonlinear function.Due to the high dimensionality of the weight vector, one solves the dual problem of CSVM by the kernel trick.In some applications, data appear in a rich dimensional feature space; the performances are similar with/without nonlinear mapping.If data are not mapped, we can often train much data set.
For updating  , to  ,+1 , the following one-variable subproblem is solved as where   = [0, . . ., 1, . . ., 0]  .This is general process of one-variable coordinate descent method.However, the sum constrained condition of the optimization problem (13) is denoted as follows: where   is exactly determined by the other   ,s, and if we were to hold {  }  =1 \   fixed, then we cannot make any change   without violating the constraint condition (14) in the optimization problem.
Thus if we want to update some subject of the   ,s, we must update at least two of them simultaneously in order to keep satisfying the constraints.We currently have some setting of the   s that satisfy the constraint conditions (14) and suppose we have decided to hold {  }  =1 \   ,   fixed and reoptimize the dual problem of CSVM (10) with respect to   ,   (subject to the constraints).Equation ( 6) is reformatted as Since the right hand side is fixed, we can just let it be denoted by some constant : We can form the updated coordinates before and after with respect to   ,   : ,  +1    are updated coordinates before and after with respect to    .Consider the following dual two-variable subproblem of 2C-SVM: where  is constant,   = [ , , . . .,  ,−1 ,  , , . . .,  , ], and From ( 18), we obtain where ( , )/  is the th component of the gradient   ().
By incorporating (21), ( 18) is rewritten as From (17), we have   = −      .Equation ( 22) is reformed as follows: Treating (  +   ), ((( , )/  ) − (( , )/  )    ),  as constants, we should be able to verify that this is just some quadratic function in   .We can easily maximize this quadratic function by setting its derivative to zero and solving the optimization problem.The following two-variable optimization problem is obtained as min The closed form is derived as follows: We now consider the box constraints (10) of the twovariable optimization problem.The box constraints as 0 ≤   ≤ ,  ∈  + , 0 ≤   ≤ (1 − ),  ∈  − are classified to four boxes constraints according to the labels of examples from some two coordinates.
Firstly, suppose that the labels of examples are as The sum constraints of Lagrange multipliers according to these labels are as Another expressing of sum constraints of Lagrange multipliers is as The box constraints of Lagrange multipliers are defined as We obtain the new expressing of box constraints of Lagrange multipliers as the following: Thus we obtain stricter box constraints of Lagrange multipliers   ,   according to   = +1,   = −1 as the following: Secondly, suppose that the labels of examples are as Similarly, similar stricter box constraints of Lagrange multipliers   ,   are obtained as follows: Thirdly, suppose that the labels of examples are as Similarly, the similar stricter box constraints of Lagrange multipliers   ,   are obtained as follows: Finally, suppose that the labels of examples are as Similarly, the similar stricter box constraints of Lagrange multipliers   ,   are obtained as follows: For the simplification, set ,+1  is defined as the temp solution, which would be edited to satisfy From linear constraints with respect to  = , then the constrained conditions are not updated.
Due to the box constraints of (10), there will appear a lot of boundary points of coordinates ( ,  =  or  ,  = 0) in computing process.If there are two coordinates ( ,  and  ,  ) as the value of 0 or  in an iteration process, the analytical solution of the two subvariables updates coordinates without calculating.The reason is that the formula (17) guarantees  ,  +  ,  = 0 or  ,  +  ,  = 2, while double restricted box constrained optimization 0 ≤  ,  ≤  if the result is 0 or .Constrained conditions will be edited ultimately as 0 or .The constrained conditions are not updated.
Condition 2. If projected gradient is 0, the constrained conditions are not updated.

Dual Problem of Cost-Sensitive Support Vector Machine Using Randomized Dual Coordinate Descent Method and Its
Complexity Analysis.From the above algorithm derivation of view, solving of CSVM seems to have been successful, but the computational complexity of the solving process is also larger.Assume that the average value of nonzero feature of each sample is .Firstly, the computational complexity of the inner product matrix   =          ,  = 1, 2, . . .,  is (), but the process can be operated in advance and stored into memory.Secondly, the calculation of ( , )  = ∑  =1  ,  ,  takes the computational complexity ().The amount of calculation is very great when the data size is large.However, there is a linear relationship model CSVM: Thus, ( , )  is further simplified as follows: Solving the corresponding formula (23) can be simplified as follows: As can be seen, computing   with the complexity of () becomes computing       with the complexity of ().Thus the calculation times reduce  − 1.However, where  is the updated still computational complexity ().
The amount of calculation can be reduced significantly when  is updated by changing  Its computational complexity only is ().So, whether calculating   or updating , coordinate gradient computation complexity is (), which is one of the coordinate gradient method rapid convergence speed reasons.
When assigned an initial value and constraints based on set the initial point where   = 1/,   = (1/( − )),  are the total number of samples and the number of positive samples, respectively.Thus the weight vector  of the original problem is obtained by optimizing Lagrangian multipliers  of the dual problem. , and the formula (45) to update , respectively.We can see that inner iteration takes () effort.The computer memory is mainly used to store samples information  1 ,  2 , . . .,   and each sample point and their inner products   =          .Costsensitive support vector machine using dual randomized coordinate gradient descent algorithm is described as follows.

Description of
Algorithm 3. Cost-sensitive support vector machine using randomized dual coordinate descent algorithm (CSVM-RDCD).
Until A stopping condition is satisfied, End for.
To evaluate the performance of CSVM-RDCD method, we use a stratified selection to split each dataset to 9/10 training and 1/10 testing.We briefly describe each set below.For each dataset we choose the class with the higher cost or fewer data points as the target or positive class.All multiclass datasets were converted to binary data sets.In particular, the binary datasets SIAM1 and SIAM2 are datasets which have been constructed from the same multiclass dataset but with different target class and different imbalance ratios.Evaluation of the performance of the three algorithms using CSVM was 10-fold cross-validation tested average misclassification cost, the training time, and the recognition rate of the positive class (i.e., the recognition rate of the minority class).
Average misclassification cost (AMC) represents 10-fold cross-validation average misclassification cost on test datasets for related CSVMs, described as follows: Training time in seconds is used to evaluate the convergence speeding of three algorithms using CSVM on the same computer.
Cost-sensitive parameters of three algorithms using CSVM are specified as the following Table 2. Cost-sensitive parameters  and (1 − ) are valued according to the class ratio of datasets, namely, the class ratio of minority class and majority class.
Three datasets with relative class imbalance are examined.Namely, KDD99 (intrusion detection), Web span, and MNIST datasets are considered.Three datasets with severe class imbalance are examined.Namely, Covertype, SIAM1, and SIAM11 datasets are considered.The average misclassification cost comparison of three algorithms using CSVM is shown in Figure 1 for each of the datasets.The CSVM-RDCD algorithm outperforms the PCSVM and CSVM on all datasets.
The recognition rate of positive class comparison of three algorithms using CSVM is shown in Figure 2 for each of the datasets.The CSVM-RDCD algorithm outperforms the PCSVM and CSSVM on all datasets, surpasses the PCSVM on four datasets, and ties with the PCSVM on two datasets.
We examine large datasets with relative imbalance ratios and severe imbalance ratios to evaluate the convergence  speed of CSVM-RDCD algorithm.The training time comparison of three algorithms using CSVM is shown in Figure 3 for each of the datasets.The CSVM-RDCD algorithm outperforms the PCSVM and CSSVM on all datasets.

Experiment on Real-World Big Class-Imbalanced Dataset
Classification Problems.In order to verify the effectiveness of the proposed algorithm CSVM-RDCD on real-world big class-imbalanced data classification problems, it was  Experimental results show that it is applicable to solve cost-sensitive SVM dual problem using randomized dual coordinate descent method on the large-scale experimental data sets.The proposed method can achieve superior performance in the average misclassification cost, recognition rate of positive class, and training time.Large-scale experimental data sets show that cost-sensitive support vector machines using randomized dual coordinate descent method run of training time on large-scale data sets.CSSVM needs to build complex whole gradient and kernel matrix  and needs to select the set of complex work in solving process of decomposition algorithm.Decomposition algorithm updates full uniform gradient information as a whole, the computational complexity () for the full gradient update.PCSVM also has similar computational complexity.Randomized dual coordinate gradient method updates linearly the coordinates, its computational complexity as (), which increases considerably the convergence speed of the proposed method.

Conclusions
Randomized dual coordinate descentmethod (RDCD) is the optimization algorithm to update the global solution which is obtained by solving an analytical solution of the suboptimal problem.The RDCD method has the rapid convergence rate, which is mainly due to the following: (1) the subproblem has formal analytical solution, which is solved in solving process without complex numerical optimization; (2) the next component of RDCD method in solving process is updated on the basis of a previous component; compared with the full gradient information updated CSSVM method as a whole, the objective function of RDCD method can decline faster; (3) the single coordinate gradient calculation of RDCD method is simpler and easier than the full gradient calculation.Randomized dual coordinate descent method is applied to cost-sensitive support vector machine, which expanded the scope of application of the randomized dual coordinate descent method.For large-scale class-imbalanced problem, a cost-sensitive SVM using randomized dual coordinate descent method is proposed.Experimental results and analysis show the effectiveness and feasibility of the proposed method.

where
and  represent the number of the positive examples misclassified as the negative examples in the test dataset and the number of the negative examples misclassified as the positive examples in the test data set, respectively. and (1 − ) denote the cost of the positive examples misclassified as the negative examples in the test data set and the cost of the negative examples misclassified as the positive examples in the test data set, respectively. denotes the number of test examples.The recognition rate of the positive class is the number of classified positive classes and the number of the positive classes on testing dataset.

Figure 1 :
Figure 1: Average misclassification cost comparison of three algorithms using CSVM on 6 datasets.

Figure 2 :
Figure 2: The positive class recognition rate comparison of three algorithms using CSVM.

Figure 3 :
Figure 3: Training time (s) comparison of three algorithms using CSSVM.

Table 1 :
Specification of the benchmark datasets.Number of ex. is the number of example data points.Number of feat. is the number of features.Ratio is the class imbalance ratio.Target specifies the target or positive class.Number is the number of the datasets.

Table 2 :
Cost-sensitive parameters of three algorithms using CSVM on 6 datasets.

Table 3 :
Specification of the benchmark datasets and cost-sensitive parameters on the dataset.

Table 3 .
The statistical results of the big class-imbalanced data problems that measure the quality of results (average misclassification cost, recognition rate of positive class, and training time) are listed in Table4.From Table4, it can be concluded that CSVM-RDCD is able to consistently achieve superior performance in the big class-imbalanced data classification problems.

Table 4 :
Classification performance of three algorithms using CSVM on the dataset.