Support Vector Machines for Unbalanced Multicategory Classification

Classification is a very important research topic and its applications are various, because data can be easily obtained in these days. Amongmany techniques of classification the support vector machine (SVM) is widely applied to bioinformatics or genetic analysis, because it gives sound theoretical background and its performance is superior to other methods. The SVM can be rewritten by a combination of the hinge loss function and the penalty function. The smoothly clipped absolute deviation penalty function satisfies desirably statistical properties. Since standard SVM techniques typically treat all classes equally, it is not well suited to unbalanced proportion data. We propose a robust method to treat unbalanced cases based on the weights of the class. Simulation and a numerical example show that the proposed method is effective to analyze unbalanced proportion data.


Introduction
Classification is a very important research topic and is applied to many applications such as health science and bioinformatics.Many classification methods have been proposed in the literature, for example, linear discriminant analysis, logistic regression, the -nearest neighbors, and support vector machine (SVM) as in [1].Among them SVM is considered to be popular in engineering, because it gives sound theoretical background.It has been widely applied to bioinformatics or genetic analysis and its performance is superior to other methods.
In these days we can easily obtain high dimensional data thanks to computer technology.Since the number of predictors is very large, variable selection is crucial to get a meaningful model.In fact the model with one thousand nonzero predictors cannot be interpretable and it does not give any information on the data.Furthermore if the true model is sparse, the fitted model should be sparse.Variable selection is an important research topic in linear regression modeling especially when there are tremendous predictors (see [2]).Thus simultaneous variable selection and estimation methods have been suggested.It is called a penalized method.The SVM can be considered as a penalized method consisting of the hinge loss function and the  2 penalty function.
After the Ridge regression estimator many penalty functions have been proposed.Among them the least absolute shrinkage and selection operator (LASSO) proposed by [3] with the  1 penalty function is very popular, because it has the sparse property.However, the LASSO estimate can be biased for the large absolute coefficients, which was pointed out by [4].They proposed a nonconvex penalty function, the smoothly clipped absolute deviation (SCAD) penalty satisfying desirable properties: the unbiasedness, the sparseness, and the continuity.Since the SCAD function is not convex, the standard optimization algorithm cannot be applied.Thus the locally quadratic approximation (LQA) algorithm can be adopted to solve the optimum of the objective function as in [5].
The traditional SVM treats the observations for each class equally.When the observations for each class are unequal, especially the observation for one of classes being relatively small to others, the SVM does not consider the classes corresponding to the smallest numbers of observations.Ignoring the minority classes the traditional SVM can decrease the overall misclassification rate.However, the minority classes have important characteristics for the data.For example, the cancer patients class from the data of hospital patients is much smaller than other patients.However, the cancer class 2 Mathematical Problems in Engineering has significant information on the death rate and so the cancer class should not be ignored.
In this paper we develop SVM algorithms to treat unbalanced cases based on the weights of the loss function and the SCAD penalty function.Since it increases the impact of the minority classes, the result of classification does not ignore the minority classes while the traditional SVM ignores the minority classes.We used the local linear approximation (LLA) of the SCAD function to cover the deficiencies of the LQA algorithm.
This paper is organized as follows.In Section 2 we review the SVM for unbalanced cases and its statistical properties.We consider two classifiers based on the overall misclassification rate and the sum of within group error rate.The latter classifier does not ignore the minor classes and it is more applicable to unbalanced cases.The  1 and the SCAD penalty functions are briefly reviewed.Section 3 gives an algorithm to implement the proposed method.Reference [6] proposed a LQA for the SVM with the SCAD function.We use a LLA algorithm which can be written by a linear programming problem to minimize the nondifferentiable and nonconvex objective function.Section 4 provides the simulation results and a numerical example.They show that the performance of the proposed method is superior to the traditional SVM in the view of the consideration of the minority classes.Section 5 gives some discussion and concluding remarks.

SVM for Unbalanced Cases
For a multiclass classification problem, a training sample {(x  ,   )},  = 1, . . ., , is given, where the input vector x  ∈   and the output label   ∈ {1, . . ., },  is the number of observations, and  is the dimension of the input vector.Suppose that the samples are drawn from an unknown joint probability distribution (x, ).
The classifier  :   → {1, . . ., } trained by the training sample can predict the class of future input vector x.The standard classification criterion gives the same misclassification costs for different classes.The loss function in this case is the 0-1 loss function, (, (x)) = ( ̸ = (x)), and the risk function corresponding to the 0-1 loss function can be written as ∑  =1 (  ̸ = (x  ))/ in the empirical version.Here the function () implies the indicator function.However, when there are minority classes in the output label, the classifier based on the overall misclassification rate does not give the information on the minority classes having very important characteristics of the data.For example, the proportion of the cancer patients is a minor class among general patients visiting a hospital.However, the cancer patients can be especially considered for the hospital.Unbalanced proportion samples are often found in real world data.
The classical criterion finds a decision rule minimizing the overall misclassification rate: where (X, ) is a ( + 1)-dimensional random vector from a probability distribution (X, ) and   is the proportion of class  in the population.If   is very small, the overall misclassification rate (1) can be very small, because the misclassification rate for the th class does not influence the quantity.Even if the misclassification rate for the th class is very high, it could be ignored.Thus if the proportion term   in (1) was deleted, then the minor class having very small   could influence the classifier.We consider a classifier which minimizes by discarding the term   in (1).Denote   = [((X) ̸ = ) |  = ],  = 1, . . ., , as the within group error for the th class.The empirical version of term (2) can be calculated as the sum of the ratio of the number of misclassifications in class , the sum of the within group error rate (see [7]).Criterion ( 2) is called the within group error rate criterion.
Research on this topic has focused on the methods at the data and algorithmic levels and it can be categorized such as resampling methods for balancing the dataset, modification of existing learning algorithm.Undersample and oversample are methods of resampling technique.Reference [7] proposed an adaptive weighted learning using the SVM.
The SVM is a classification method based on a large margin.It finds the hyperplane maximizing the separation distance between the two classes in the separable case.The SVM uses the slack variables, the so-called soft margin method to minimize the separation distance with the mislabeled samples.There are several extensions from the binary SVM to the multiclass SVM such as one-versus-one and one-versus-rest (see [8]).However, these methods provided poor performance when the data is dominated by only one class.Reference [9] proposed a simultaneous multiclass SVM.Reference [5] suggested a simultaneous SVM algorithm with the SCAD penalty function.
The multiclass SVM minimizes the objective function (see [9]) where ) with a sum-to-zero constraint ∑  =1   (x) = 0, and the parameter  controls the trade-off between the training error and the model complexity.Equation ( 3) is limited to the linear SVM, where   (x) =   +∑  =1     .The SVM objective function consists of the loss function and the penalty function, which is the same as the objective function in penalized linear regression (see [10]).
For a sparse model the  1 SVM as in [11] provides the information on valuable variables by discarding redundant noise input variables like the LASSO as in [3] which is popular in penalized linear regression models.Reference [12] proposed a multiclass SVM, which performs classification and variable selection simultaneously through an  1 -norm penalized sparse representation.However, the  1 solution is biased for large coefficients.Reference [4] proposed a nonconvex penalty function, the SCAD penalty function, and desirable properties for penalty functions such as the unbiasedness of the estimator, the sparseness of the model, and the continuity of the estimator on the tuning parameter .Unfortunately, the  2 penalty does not satisfy the unbiasedness and the sparsity, and the  1 penalty does not satisfy the unbiasedness.
The SCAD function can be written by where  > 2 and  > 0. Reference [4] recommended that the parameter  = 3.7 from the simulation results.Since the derivatives of the SCAD function outside the range [−, ] are zero, the SCAD SVM estimates have the unbiasedness.
Because the SCAD function is singular at zero, the SCAD SVM provides a sparse model.Like the standard SVM (3) the SCAD SVM minimizes the objective function In linear SVM case,   (x  ) =   + w   x  gives the objective function of the SCAD SVM: subject to The classical classification rule (1) naturally becomes And the classifier based on the within group error rate (2) can be written by If one class dominates all population, in the point of the misclassification rate all of input vector can be classified as the dominating class.However, it does not yield a meaningful classifier.For unbalanced proportion data (6) cannot detect the minority class.The equal hinge loss function in (6) can ignore the minority class and so the unequal hinge loss function with the weight for the class can solve the unbalanced case as in [7].Now instead of the unweighted SVM objective function (6) we consider a weighted SVM objective function with the SCAD penalty function in unbalanced cases: subject to where V  is the weight for the th observation with the th class.A weighted SVM is proposed for the robustness of the SVM which is not sensitive to outlier or leverage points (see [13]).We consider the weight for each class as where π is the estimate of the population proportion of the th class as in [7].It is called a proportional weight to the number of observations for each class.Weight ( 12) considers only the unbalanced proportion.When the dataset has outlying observations, the SVM based on weight (12) may not give the exact underlying decision function   (x).Thus we propose the weight for the data as where ê denotes the within group errors on the training dataset with equal weights.We call this weight an adaptive weight.The within group errors can be calculated as the misclassification rate for the th class.Weight (13) gives much more weights on the minority class and the well-classified group got the less weight.The larger values of |  (x  )| in (13) represent well-classified observations.Thus well-classified observations are given by small weights.Also the small values of |  (x  )| imply the corresponding observations to be misclassified.Therefore, the corresponding weights get larger and the learning machine keeps the observations having important information.The proposed weight (13) considers both robustness and unbalanced proportion of the data, because the terms (1/π  ) ê and 1/(1 + |  (x  )|) reflect the unbalanced proportion of the data and the resistance to outlying observations, respectively.However, weight (12) only considers the unbalanced proportion for the classes.

Algorithm
Since the SCAD function is not convex, the objective function in (10) is not convex.Thus standard optimization techniques cannot be adopted.Usually the nonconvex penalty can give good statistical properties but the implementation is not easy.
Reference [4] used the LQA algorithm which has drawbacks like the backward elimination.That is, for numerical stability if the coefficient of the variable is close to zero, the variable Mathematical Problems in Engineering was deleted and it is not included in the final model.Furthermore, the solution of the LQA algorithm can be written in a Ridge-type and it does not guarantee the sparseness of the solution like the property of the Ridge regression estimator.Reference [14] proposed a perturbation of the LQA algorithm and the proposed algorithm renders the objective function differentiable.Then it optimizes this differentiable function using a minorize-maximize algorithm.But it is not easy to select the size of perturbation.Reference [15] proposed a methodology on a one-step sparse estimation procedure in nonconcave penalized likelihood models which is called the LLA algorithm.It has neither the drawbacks of the LQA algorithm nor the numerical instability at zero.By the Taylor expansion of the SCAD function we obtain the following approximation equality: where the first derivative function of the SCAD function By putting ( 14) into (10), we obtain the objective function up to constants where  0  is an initial solution near the true value   and the restriction of ( 10) is still effective.We introduce the variable   = V  (  ̸ = )(w   x  +   + 1) + with adjusted subindex  and use the fact that || =  + +  − ,  =  + −  − , and  + ≥ , where  − is defined similar to  + .Then the weighted SVM objective function (10) becomes +  ≥ 0,  −  ≥ 0, for  = 1, . . ., ;  = 1, . . ., ;  = 1, . . ., . (18) The first derivative function of the SCAD function ( 4) was evaluated at the initial value  0  .Equation ( 17) can be minimized by standard optimization packages.We can obtain the optimum solution  +  ,  −  ,  +  ,  −  and then the parameters can be estimated by   =  +  −  −  and   =  +  −  −  .Equation ( 17) can be solved by lpSolve in  program.
The linear programming problem can be formulated by min where z is an -vector of variables with  = (+2+1)−, which is composed of the variables   ,  +  ,  −  ,  +  ,  −  .Now we set the constraints A (−1)× , f (−1)×1 and the coefficient vector of the objective function from the linear programming problem in (17).
Since the SCAD function is approximated by the absolute function, the method is similar to the  1 penalty SVM.From the point of the variable selection view it is well known that the  1 SVM gives very useful information on the variable selection for classification with greatly exceeding the number of predictors (see [12]).Thus the proposed method is very effective in variable selection, especially when the number of variables is large.
We summarize the proposed algorithm for the weighted multiclass SVM with the SCAD penalty function by the following steps.
(1) Tuning process is as follows.
(3) Test process is as follows.
(3.1) Calculate the misclassification rates for test data based on the tuning parameter  and the parameters   and   .

Numerical Experiment
In this section, we compare the performance of the proposed algorithm and the classical SVM of the multiclass SVM for unbalanced proportion data.We conducted the experiments on simulation and a real dataset.
Our simulation data consists of training data, tuning data, and independent test data.The coefficients   ,   are estimated based on the training data and the performance is evaluated on the test data of the sample sizes 10, 000 and 18, 000 generating from the same distribution described above.The tuning data of sizes 100 and 180 are used to determine the tuning parameter .We conducted simulation iterations to evaluate the within group error for each class.All simulations are carried out using  program.
Tables 1 and 2 summarize the within group error rate for equal weights, proportion weights (12), and the adaptive weight (13) combining the proportion, misclassification rates, and the robustness to outliers.The values in the tables are evaluated on the test data.We consider two classifiers, the classical classifier (CLSC) of (1) and the minimum within group error (MWGE) of (2).We compared the proposed algorithm with the equal weight SCAD SVM which is a classical SCAD SVM.Tables 1 and 2 show that the equal weight scheme ignores the minor class 1 to minimize the overall misclassification rate.However, the proposed algorithms (proportional weights and adaptive weights) did not ignore the minor class 1 which may be a main interesting class.For the equal weight scheme the misclassification rate for class 1 is close to 1, but the overall misclassification rate is not high, because class 1 has the small proportion.Particularly for the proportion 1 : 4 : 4 case the equal weights scheme did not discriminate class 1 at all.It is not desirable, when class 1 itself is the main interest.As the degree of unbalanced data goes up, the proposed algorithms will get the effectiveness of weights.
We can see that the MWGE method is more powerful compared to the CLSC method as the minority of the data become higher with proportions changing from 1 : 2 : 2 to 1 : 4 : 4. Two weight schemes have similar results in terms of within group error rate.Tables 1 and 2 give us that the proposed weighted SVM helps to decrease the unbalanced proportion of data.
The final weights for classes are reported in Table 3.The adaptive weight scheme gives more large weights on the minor class than the equal weight or the proportional weight.It gives that the adaptive weights scheme has small within group error for the minor class 1 in Tables 1 and 2. We conducted 100 replications of such experiment and summarized the results in Tables 4 and 5.We can see that the test errors for classes 2 and 3 (patients class) using the equal weights scheme are higher than class 1, because class 1 (normal class) dominates highly the classes of the data.The proportional weights and the adaptive weights algorithms perform better in discriminating the patients from the normal group, while the correction rate of the normal group becomes low.In Table 5 the weights for class 1 in the proportional and adaptive weights schemes become low and it makes the within group error rate increase for class 1.The weights for classes 2 and 3 become high.However, the weight for all classes in the classical weight scheme is 1/3.It does not consider the unbalanced proportion of the data.
We summarize the overall misclassification rate in Table 6.It shows that the classical equal weight scheme with MWGE classifier has the best performance in the view of overall misclassification rate.However, the within group error rates of classes 2 and 3 for this scheme are higher than those of the proportional weight scheme.We should consider the weight SVM for the classification of a thyroid patient.
Suppose that a thyroid patient was classified as a normal person.The patient missed the opportunity to be given medical treatment.It may be a dangerous situation.This example shows that the minor classes should be significantly treated.We recommend that we use both the classical equal weight scheme and the weighted SVM.

Conclusion
In this paper we proposed the weighted SCAD SVM for unbalanced multicategory classification problems.When a class has small proportion among all classes, numerical simulation and data analysis show that the proposed method is a considerable method, especially in the view of the within group error rates of the minor classes.We will study the method to consider both the overall misclassification rate and the within group error rate in the future.

4. 2 .
Real Dataset.In this section, we apply the proposed method to the thyroid gland database from the UCI Machine Learning website http://www.ics.uci.edu/∼mlearn/.The data was used to try to predict whether a patient's thyroid will be classified as the class euthyroidism, hypothyroidism or hyperthyroidism.The diagnosis (the class label) was based on 5 medical records such as anamnesis and scan which are continuous.Three classes are normal (class 1), hyperthyroidism (class 2), and hypothyroidism (class 3).The data has 215 observations which consisted of 150 observations (class 1), 35 observations (class 2), and 30 observations (class 3).It is unbalanced data because near 70 percent of the patients are normal (class 1).Classes 2 and 3 are minority classes.For evaluation of the performance of the proposed algorithm we divide the data randomly into three categories such as a training set with 72 observations, a tuning set with 72 observations, and a test set with 71 observations.The training and tuning sets have the same class proportion 25 : 6 : 5. We obtained the tuning parameter  of SVM from the tuning data and the estimates of the model parameters   ,   of SVM from the training data.The within group error rates of the SVMs are evaluated based on the test data.

Table 1 :
The within group error rates of the weighted SCAD SVMs in the unbalanced case with proportion 1 : 2 : 2.

Table 2 :
The within group error rates of the weighted SCAD SVMs in the unbalanced case with proportion 1 : 4 : 4.

Table 3 :
The average final chosen weights for the weighted SCAD SVMs on the simulation data.

Table 4 :
The within group error rates of the weighted SCAD SVMs on the new thyroid data.

Table 5 :
The average final chosen weights for the weighted SCAD SVMs on the new thyroid data.

Table 6 :
The overall misclassification rate of the weighted SCAD SVMs on the new thyroid data.