A Structural SVM Based Approach for Binary Classification under Class Imbalance

Class imbalance situations, where one class is rare compared to the other, arise frequently in machine learning applications. It is well known that the usual misclassification error is not suitable in such settings. A wide range of performance measures such as AM and QM have been proposed for this problem. However, due to computational difficulties, few learning techniques have been developed to directly optimize for AM or QM metric. To fill the gap, in this paper, we present a general structural SVM framework for directly optimizing AM and QM. We define the loss functions oriented to AM and QM, respectively, and adopt the cutting plane algorithm to solve the outer optimization. For the inner problem of finding the most violated constraint, we propose two efficient algorithms for the AM and QM problem. Empirical studies on the various imbalanced datasets justify the effectiveness of the proposed approach.


Introduction
Classification problem with class imbalance where one class is rare compared to the other is a common yet important problem in supervised learning.It arises in many applications, ranging from medical diagnosis and text retrieval to credit risk prediction and fraud detection [1][2][3][4].Due to its practical importance, it has been identified as one of the ten most challenging problems in data mining research [5].For reasons of simplicity and with no loss in generality, only binary classification problems under class imbalance are considered in this paper.However, it is important to keep in mind that the class imbalance problem is pervasive in other areas as well such as multiclass classification and association rule mining.
It is well known that the usual binary learning algorithms are ill-suited in the imbalanced domains, because those classifiers will cause a bias towards the majority class and result in a lower sensitivity in detecting the minority class examples [6].In the literature of solving class imbalance problems, a variety of approaches have been proposed, which can be mainly categorized into two groups: the data-oriented methods and the algorithm-oriented methods.
The data-oriented methods use various sampling techniques to oversample instances in the minor class [6][7][8] or undersample those in the major class [9,10], so that the resulting data is balanced.A typical example is the SMOTE approach [6] which increases the number of minor class instances by creating synthetic samples.
The second group algorithm-oriented methods aim at the extension and modification of existing classification algorithms so that they can be more effective in dealing with imbalanced data.For example, Liu et al. and Kang and Ramamohanarao have presented two different modified decision tree algorithms for improving the standard C4.5, such as CCPDT [11] and HeDEx [12], while Köknar-Tezel et al., Joachims et al., and Lipton et al. have proposed various approaches to improve traditional SVM's performance on the imbalanced settings [13][14][15][16][17][18][19][20][21][22].
Those two groups are both effective and it is difficult to say which one is better.However, since, in this paper, our 2 Mathematical Problems in Engineering goal is to improve the existing statistical learning algorithm, in the following we are interested in algorithm-oriented method and propose a modified SVM approach by directly optimizing imbalance measure.It seems that our algorithm is similar to the algorithms in [15][16][17][18][19][20][21][22]; however, we design different objective functions and use different optimization techniques with theirs.More specifically, this paper makes the following contributions.
(1) We adopt 1-slack structural SVM as the framework and define the loss functions oriented to AM and QM, which are rarely considered in the literature of optimizing imbalance metrics.
(2) We show that the QM loss is a lower bound of the AM one, which means our QM classifier may be more accurate than the AM one.
(3) For the inner computational challenge of the AM loss, we propose to decompose it nicely and apply the Hardy-Littlewood-Polya inequality to solve it in ( log ) time, while, for the case of QM, such decomposition is impossible.We present an efficiently greedy method for solving this problem, which also requires ( log ) time.
(4) Empirical evaluations on the imbalanced datasets demonstrate that the proposed algorithms are not only significantly better than standard binary learning algorithm but also competitive to other existing imbalanced algorithms.
The remainder of the paper is organized as follows.In Section 2 the related work is presented.Section 3 discusses the details of our proposed algorithms and the empirical results on the benchmark datasets are reported in Section 4. Section 5 concludes the paper and discusses the future work., where   ∈   is the th example and   ∈ {+1, −1} is the corresponding class label.The binary classification problem is to construct a classifier function (), which gives generalization performance.We assume that the classifier function is of the form () =  ⋅  and the decision function of the form  = sign(()) is used when finding the label of an unseen example.Note that we have not included the bias term in the classifier function for notational convenience.However, it can be incorporated in a straightforward way.

Related Work
In machine learning area, a common way to find the linear parameter  ∈   is minimizing a regularized risk function: where  > 0 is a constant that controls the trade-off between training error minimization and margin maximization. is a suitable loss function which measures the discrepancy between a true label   and a predicted value from using .Different loss functions yield different learners.One of the most famous loss functions is the hinge loss in SVM, which has the form of (  ,   , ) = max(0, 1 −     ).

Relevant Background.
Standard SVM has been used to optimize an estimation of classification error on the training set and was shown to be a very powerful tool for classification problems when data is balanced.However, if the data is highly imbalanced, classification error is not always a good measure, and the standard SVM can be misleading.To solve this problem, a number of modified algorithms have been proposed.For example, Köknar-Tezel and Latecki [13] and Shao et al. [14] proposed approaches to improve SVM on imbalanced datasets, which they called GP and WLTSVM, respectively.But their works are both focused on improving sampling techniques (e.g., modifying SMOTE in GP) for SVM and do not solve the problem of training bias in the design of SVM learning algorithm per se.Recently, with the advances in learning to rank, direct optimization of the ranking measure technique has been extended to design SVM for imbalanced setting and a variety of algorithm-oriented methods have been proposed.Joachims [15] and Aiolli [16] presented algorithms to optimize AUC for the imbalanced data, and the experimental results on the unbalanced sets proved their effectiveness.Along the lines of the above works, Paisitkriangkrai and Narasimhan et al. further gave algorithms by optimizing partial AUC and successfully applied their approaches to the real-world tasks [17][18][19].Optimizing the F-measure is another popular method for imbalance learning.Joachims [15], Chinta et al. [20], Maratea et al. [21], and Lipton et al. [22] used different approximates to the F-measure and designed different classifies.Numerical experiments on the benchmark datasets demonstrated their algorithms' effectiveness.However, it is well known that, in evaluating imbalanced setting, there are many other performance measures besides AUC and F-measure, which include AM (arithmetic mean) [23] and QM (quadratic mean) [24].The AM is the arithmetic mean of the true positive and true negative rates and can be defined as The QM is a quadratic mean measure and is defined as where Although AM and QM are popular in the imbalanced setting, surprisingly, little has been focused on designing the algorithms based on them.Until very recently, Menon provided a consistent algorithm, which aimed at directly optimizing AM measure [25].This approach is effective, but it is only suitable for the AM measure; whether it can be extended to other measures such as QM is still unknown.
In contrast to Menon's work, in this paper, we will present a general learning framework, whose loss function allows us to incorporate different imbalanced measure.We exploit it for optimizing AM and QM.In the following, we will discuss our approach in detail.

DOPMID: Direct Optimization of Performance Measure for Imbalanced Dataset
3.1.The Framework of DOPMID.We referred to the classifier we presented as DOPMID (Direct Optimization of Performance Measure for Imbalanced Dataset).The framework of DOPMID is based on structural SVM proposed by Joachims et al. [26].Specifically, we use the 1-slack SVM formulation, presented in (OP1) (optimization problem 1), to learn a linear  ∈   .Note that the following approach can be extended to nonlinear function/non-Euclidean instance spaces by using kernels [27]: For simplicity, in the paper, we assume that the training dataset  = {(  ,   )}  =1 has been ordered by the positive instances ahead of the negative ones, and we define , where #pos is the number of the positive instances, #neg is the number of the negative instances, and (#pos) + (#neg) = . stands for any possible permutation of predicted list from using the parameter . represents a mapping function from input list to output list.Δ(, ) →  is a function used to measure the difference between the real output  and the predicted output .This function must satisfy the following conditions: (1) for ∀ ( = ) , Δ (, ) = 0; (2) for ∀ ( ̸ = ) , Δ (, ) > 0.
In contrast to the traditional SVM which has  slack   , there is only a single slack variable  in the (OP1) above.We refer to it as the "1-slack" SVM.

The Loss Functions
Oriented to AM and QM.For the framework above, we need to further define the functions (, ) and Δ(, ), in order to determine the optimization target.
In this paper, we first define (, ) as Then we define Δ(, ) oriented to AM and QM, respectively, as In equality ( 8) and ( 9), the function 1( ) is an indicator function, which can be demonstrated as It is obvious that Δ AM (, ) and Δ QM (, ) defined in ( 8) and ( 9) satisfy the constraint conditions in (6).It has been proved that if the function Δ satisfies (6), the slack  is a convex upper bound on the training loss regularized by the norm of the weight vector [26].
In the following, we will show the fact that although  AM ,  QM are both upper bound,  QM is a lower bound than  AM .Lemma 1.   defined by ( 9) is a lower bound than   defined by (8).
Proof.Since the slack  is a convex upper bound on the training loss, we can rewrite (OP1) as min ,≥0 Then we replace ( 8) and ( 9) with Δ(, ) and get the AM bound and QM bound, respectively: Step We can simplify (14) as Since we obtain we obtain Combining inequality ( 17) and ( 19), we get which means that  QM −  AM ≤ 0 and proves the claim.
We can solve the (OP1) by substituting (7), (8), and (9) with ( 5), but unfortunately there is still a question: for each  ∈ , inequality (5) has an exponential number of constraints.To solve this problem, we propose to use the cutting plane algorithm, which is based on the fact that, for any  > 0, a small subset of the constraints is sufficient to find an -approximate solution to the problem.The detail of the cutting plane algorithm is shown in Algorithm 1.
The algorithm starts with no constraints and iteratively finds for each possible output ŷ associated with the most violated constraint.If the corresponding constraint is violated by more than  we introduce ŷ into working set  and resolve (OP1) with the updated .It can be shown that Algorithm 1's outer loop is guaranteed to halt within (1/) iterations for any desired precision  [26].
Since the quadratic program in each iteration of this algorithm is of constant size, the only bottleneck in the algorithm is how to solve the (OP2), which is known as the problem of "finding the most violated constraint." In the following, we will show how it can be performed efficiently for the AM loss and QM loss, respectively.

Efficient Algorithms for Finding the Most Violated Constraint.
First of all, we rewrite (7) as where Γ

#neg 𝑖
according to current weight   ; hence we apply the Hardy-Littlewood-Polya inequality and observe that (OP3) is maximized by sorting the terms Γ #pos  , Γ #neg  in decreasing order.Note that this permutation is easily obtained by applying Quick Sort in ( log ) time.

The Algorithm for QM Loss.
Unlike for the AM loss, the (OP2) can be decomposed linearly in the instances.The (OP2) for the QM loss is quite different and needs a substantially extended algorithm, which we will describe in the following.First, we substitute ( 9) and ( 21) with (OP2).This gives From the above, we can see that the decomposition technology used in (OP3) is not suitable for (OP4), since (∑ #pos =1 ) 2 and (∑  =#pos+1 ) 2 can no longer be absorbed in (∑ #pos

𝑖=1
) and (∑  =#pos+1 ).To solve this problem, in the following, we will provide a more trick optimization for the (OP4).The algorithm we proposed is based on the fact that, for each   in , there is only two possible values, which denote +1, −1.So in the following we will present an efficient algorithm, which can find arg max ŷ∈ in ( log ) time (see Algorithm 2).
The idea behind Algorithm 2 is the fact that the most violated constraint ŷ must have the following form:  (0 ≤

Mathematical Problems in Engineering
Step  ≤ #pos) positive instances are labeled positive and other positive ones are labeled negative;  (0 ≤  ≤ #neg) negative instances are classified as negative and other negative ones are classified as positive.So we can get ŷ by testing each ỹ with +1, −1.Specifically, Algorithm 2 starts with a "perfect classification" (Step 2 to Step 5) and then uses the greedy algorithm to find the most violated constraint ŷ with maximum value for (OP4) (Step 6 to Step 18).
Algorithm 2 is very efficient, and its running time can be split into two parts.The first part is the sort (Step 2, Step 3), which requires ( log ).The second part is the following steps, which requires (#pos⋅#neg) time.Though in the worst case this is ( 2 ), the number of the positive instances in the imbalanced datasets is very small, which means the running time for the second part is simply ().So Algorithm 2 has complexity of ( log ).

Datasets and Baselines.
The main goal of our experiments is to evaluate whether the classifiers we proposed can outperform the existing binary classifiers in the imbalanced setting.In particular, we select three datasets with varying degrees of class imbalance, taken from the libsvm dataset repository (downloaded from: http://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/), which are named as satimage, w1a, and vowel.The characteristics of those sets are summarized in Table 1.
The "#Examples" and "#Features" denote the number of examples and features, respectively."Min(%)" represents the proportion of examples in the minority class.
On those imbalanced datasets, we compare our classifiers (DOPMID-AM, DOPMID-QM) against the following classifiers: SVM-Hinge [28], safe-level-SMOTE [7], CCPDT [11], SVM-F1 [22], and AM-Consist [25].Among them, SVM-Hinge is a traditional balanced binary classifier which seeks to minimize the hinge loss, and the latter four algorithms are all for the imbalanced setting.We choose those four as comparisons because we are interested in how our algorithms perform, when compared with other imbalanced algorithms.
In the following experiments, unless otherwise noted, the parameters  of our algorithms are all chosen in the set {2 −10 , . . ., 2 10 }, and the error parameters  of algorithms are set to 0.001.

Experimental Results.
The experimental results are reported in terms of AM [23], QM [24], and F1 [15], which are all commonly adopted as the performance metrics for evaluating imbalanced learning classifiers.More specifically, we compare our proposed classifiers from the following two aspects.

Comparison with SVM-Hinge.
We design this comparison in order to see whether the algorithms we proposed above can really be better-behaved than the standard SVM on the imbalanced datasets.The experimental results on the three sets are illustrated in Figure 1.
From Figure 1, we can see that as expected, on all the three datasets, DOPMID-AM and DOPMID-QM are significantly better than SVM-Hinge in terms of all the measures.Due to space limitation, we only report the statistics measured by AM as an example.When compared with SVM-Hinge, DOPMID-AM increases 32.21% on satimage set, 14.42% on w1a set, and 6.87% on the vowel set, while, for the DOPMID-QM, the improvement corresponds to 33.80%, 14.42%, and 13.87%, respectively.All the results prove the effectiveness of our method and once again indicate that we can obtain a more accurate unbalanced classifier by directly optimizing the imbalanced evaluation metrics.Meanwhile, it can be seen from the figures that DOPMID-QM outperforms DOPMID-AM on most of the points.More specifically, on the experimental sets, DOPMID-QM is better than DOPMID-AM with 6 out of all the 9 points and is similar to DOPMID-AM with the other three points.These results, thus, suggest that DOPMID-QM can be more accurate than DOPMID-AM.This may be due to the fact that the loss function of DOPMID-QM is lower than the DOPMID-AM's, which can create more precise classifier.This observation is also consistent with Chen's conjure that it is possible to create more accurate model by defining tighter bound [29].

Comparison with Other Imbalanced
Algorithms.In the second section, we are interested in how our algorithms perform, when compared with other imbalanced binary classifiers.More specifically, we compare our DOPMID with safe-level-SMOTE, CCPDT, and SVM-F1 and AM-Consist.We choose those four algorithms, because as discussed in Introduction, they represent two different methods for dealing with imbalanced problem.Safe-level-SMOTE uses oversampling technique and belongs to the data-oriented method.We adopt it instead of SMOTE, since it can produce better accuracy than SMOTE by using different weight degrees on the synthetic examples [7].The latter three algorithms all belong to algorithm-oriented methods.CCPDT is an efficient decision tree algorithm, which improved C4.5 on imbalanced datasets by using Fisher's exact test to prune branches of tree.SVM-F1 and AM-Consist are both SVM based approaches.SVM-F1 modified traditional SVM by optimizing F1 measure, while AM-Consist is a very recently proposed consistent classifier that aims at optimizing AM measure, which is the same as DOPMID-AM.We include it to make the effectiveness of our algorithms more convincing.It should be noted that, in the paper [25], the authors have proposed two consistent AM algorithms named as Plugin and Balanced, respectively.In our experiments, we select the Balanced as the AM-Consist, because a detailed analysis in their supplementary material shows that the Balanced performs better than the Plugin.Figure 2 depicts the behaviors of those algorithms on the satimage, w1a, and vowel datasets.
As can be seen from Figure 2, the performance of algorithms in comparison varies from one dataset to another, and there is no one algorithm that can outperform other algorithms on all the datasets.For example, when measured by AM, safe-level-SMOTE performs best on vowel set, while it is the worst one on the other two sets.SVM-F1 performs well on satimage and w1a in terms of F1; however, it yields poor performances on w1a and vowel in terms of QM.Similarly, CCPDT achieves a better performance than AM-Consist on satimage dataset, but it is worse than AM-Consist on w1a set.Different from those comparison algorithms, DOPMID-QM we proposed appears to perform more stably across all the datasets.Statistics show that when compared with other imbalanced classifiers, DOPMID-QM performs best on 4 of 9 points and is the second best on 4 of 9 points.It is the third best one on the w1a when measured by F1.All those demonstrate that even compared with others imbalanced binary classifiers, our DOPMID-QM approach is still effective and is suitable for the imbalanced settings.
Finally, we compare our DOPMID-AM with AM-Consist, because they both improve traditional SVM by optimizing AM measure.The results from Figure 2 are somewhat surprising where DOPMID-AM wins 5 of 9 points and fails in 4 points.More specifically, when compared with DOPMID-AM, AM-Consist performs better on w1a set and yields poorer performances on the other two sets.One possible explanation is that AM-Consist is a consistent algorithm and may be more suitable for the set with large number of examples (such as w1a dataset).

Conclusion
AM and QM are popular used performance measures in the imbalanced setting.In this paper, we have proposed a structural SVM based method, termed DOPMID for optimizing them.Specifically, we designed the objective functions oriented to AM and QM, respectively, and showed that the QM function is a tighter bound of the AM one.For the problem that the objective functions have exponential number of constraints, we introduce the cutting plane algorithm for outer optimization, which only needs (1/) time.Then, for the inner computational challenge of the AM loss, we presented to decompose it nicely and applied the Hardy-Littlewood-Polya inequality to solve it, while, for the QM loss, we proposed an efficiently greedy algorithm, which still only required ( log ) time.Our experiments on the imbalanced datasets showed that DOPMID is superior to the existing baseline techniques in terms of performance and stability.In future work, we hope to extend our approach to the multiclassification under class imbalance.

Figure 2 :
Figure 2: The performance comparison between DOPMID and other imbalanced binary algorithms.

Table 1 :
Characteristics of the imbalanced datasets.