A Learning Framework of Nonparallel Hyperplanes Classifier

A novel learning framework of nonparallel hyperplanes support vector machines (NPSVMs) is proposed for binary classification and multiclass classification. This framework not only includes twin SVM (TWSVM) and its many deformation versions but also extends them into multiclass classification problem when different parameters or loss functions are chosen. Concretely, we discuss the linear and nonlinear cases of the framework, in which we select the hinge loss function as example. Moreover, we also give the primal problems of several extension versions of TWSVM's deformation versions. It is worth mentioning that, in the decision function, the Euclidean distance is replaced by the absolute value |w T x + b|, which keeps the consistency between the decision function and the optimization problem and reduces the computational cost particularly when the kernel function is introduced. The numerical experiments on several artificial and benchmark datasets indicate that our framework is not only fast but also shows good generalization.


Introduction
Classification problem is an important issue in machine learning and data mining, which is mainly comprised of binary and multiclass classification. Support vector machine (SVM), proposed by Burges [1] and Cortes and Vapnik [2], is an excellent tool for classification. In contrast with conventional artificial neural networks (ANNS) which aim at reducing empirical risk, SVM is principled and implements the structural risk minimization (SRM) that minimizes the upper bound of the generalization error [3][4][5]. Within a few years after its introduction, SVM has been successfully applied to pattern classification and regression estimation like face detection [6,7], text categorization [8], time series prediction [9], bioinformatics [10], and so forth.
Recently, for binary classification, Mangasarian and Wild [11] proposed the generalized eigenvalue proximal support vector machine (GEPSVM) via two nonparallel hyperplanes. In their approach, the data points of each class are proximal to one of two nonparallel hyperplanes. The nonparallel hyperplanes are determined by eigenvectors corresponding to the smallest eigenvalues of two related generalized eigenvalue problems. Inspired by GEPSVM [11], Jayadeva et al. [12] developed twin SVM (TWSVM) with two nonparallel hyperplanes. However, the two hyperplanes are got by solving two quadratic programming (QP) problems, similar to the standard SVM. Furthermore, TWSVM differs from the standard SVM in fundamental way. In TWSVM, one solves a pair of smaller size QP problems rather than a single QP problem in the standard SVM. Therefore, TWSVM works faster than the standard SVM. Subsequently, there are many extensions for TWSVM including the improvements on TWSVM (TBSVM) [13], the least square TWSVM (LS-TWSVM) [14][15][16][17], nonparallel plane proximal classifier (NPPC) [18], smooth TWSVM [19], geometric algorithm [20], and twin support vector regression (TWSVR) [21]. TWSVM was also extended to deal with multiclassification TWSVM [22][23][24]. More precisely, in [22], TWSVM was extended straight from binary classification to multiclass classification, in which each primal problem covers all patterns except the patterns of the th class in the constraints for the th ( = 1, 2, . . . , ) hyperplane. In [23], the authors extended TWSVM based on the idea of "oneversus-rest" (1-v-r) from binary classification to multiclass classification, in which there are two quadratic programming 2 The Scientific World Journal (QP) problems for each reconstructing binary classification. However, they both have not kept the advantage of TWSVM which has lower computational complexity than that of the standard SVM. In [24], Yang et al. proposed multiple birth SVM (MBSVM) with much lower computational complexity than that of both [22,23] by solving smaller size of QP problems for -class classification; only the empirical risk is considered like TWSVM. However, in TBSVM [13], the structural risk minimization principle is implemented by introducing the regularization term.
In this paper, we propose a novel learning framework of nonparallel hyperplanes support vector machines based on TWSVM and its extension versions, called NPSVMs, which not only provide a unified view for TWSVM and its many extension versions but also can deal with binary and multiclass classification problems. For binary classification, if the loss function is the hinge loss function, then the framework can become TWSVM [12] or TBSVM [13] with different parameters; if the loss function is the square loss function, then the framework is LS-TWSVM [14]; if the loss function is the convex combination of the linear and square loss functions, then the framework is NPPC [18]. Actually, we can also get smooth TWSVM [19] by replacing 2-norm with 1-norm in the framework. However, for multiclass classification, the framework does not directly extend, in which we switch the roles of the patterns of the -th class and the rest class and replace "min" with "max" in the decision function. Moreover, we only use the absolute value | T + | rather than the Euclidean distance in the decision function due to the twofold reasons: reducing the computational cost particularly when the kernel function is introduced and making the consistency since it is the corresponding absolute value that appears in the primal problems. Concretely, we discuss the linear and nonlinear cases of the framework, in which we select the hinge loss function as example. Moreover, we also give the primal problems of extensions of LS-TWSVM, 1-norm LS-TWSVM, NPPC, and smooth TWSVM. Finally, the numerical experiments on several artificial and benchmark datasets indicate that our frameworks are not only fast but also show good generalization.
The paper is organized as follows. Section 2 introduces the brief reviews of SVMs. Section 3 proposes our frameworks, in which Section 3.1 discusses the linear framework, Section 3.2 extend into the nonlinear framework, Section 3.3 gives SOR algorithm for solving the hinge NPSVMs, and Section 3.4 discusses several other extension approaches. Finally, Section 4 deals with experimental results and Section 5 contains concluding remarks.

Brief Reviews of SVMs
2.1. Twin Support Vector Machine. Given the following training set for the binary classification: where ( , ) is the th data point, the input ∈ is a pattern, the output ∈ {1, 2} is a class label, = 1, . . . , , and is the number of data points. In addition, let 1 and 2 be the number of data points in positive class and negative class, respectively, and = 1 + 2 . Furthermore, the matrices 1 ∈ 1 × and 2 ∈ 2 × consist of the 1 inputs of Class 1 and the 2 inputs of Class 2, respectively. The gaol of TWSVM [12] is to find two nonparallel hyperplanes in -dimensional input space: such that one hyperplane is close to the patterns of one class and far away from the patterns of the other class to some extent. TWSVM is in spirit of GEPSVM [11]. But both of GEPSVM and TWSVM are different from the standard SVM. For TWSVM, each hyperplane is generated by solving a QP problem looking like the primal problem of the standard SVM. The primal problems of TWSVM can be presented as follows: where 1 and 2 are nonnegative parameters and 1 and 2 are vectors of ones of appropriate dimensions. In the QP problem (4), the objective function tends to keep hyperplane (2) close to the patterns of Class 1 and the constraints require the hyperplane (2) to be at a distance of at least 1 from the patterns of Class 2. The QP problem (5) has similar property. Moreover, we note that the constraints do not contain all patterns in the training set (1) but are determined by only the patterns of one class in both classes. Therefore, in [12], the authors claimed that TWSVM is approximately four times faster than the standard SVM. Define = [ 2 2 ] and = [ 1 1 ]. It has been shown that when both T and T are positive definites, the Wolfe duals of (4) and (5) are written as follows: respectively, where 2 and 1 are Lagrangian multipliers. In order to avoid the possible ill-conditioning of T and T , TWSVM introduces a term The Scientific World Journal 3 is an identity matrix of appropriate dimensions. Thus, the nonparallel hyperplanes (2) and (3) can be obtained from the solutions 1 and 2 of the QP problems (6) and (7). Consider where = [ T ] T , = 1, 2. Moreover, a new pattern ∈ is assigned to Class ( = 1, 2), depending on which of the two nonparallel hyperplanes given by (2) and (3) lies closer to; that is, 2.2. Multiple Birth Support Vector Machine. Given the training set where the input ∈ , = 1, . . . , , is the pattern and the output ∈ {1, . . . , } is the class label. The task is to seek hyperplanes, and assign the class label according to which hyperplane a new pattern is farthest from. For convenience, denote the number of data points of the th class in the training set (10) as and define the following matrixes: the patterns belonging to the th class are represented by the matrix ∈ × , = 1, . . . , . In addition, define the matrix that is, ∈ ( − )× consists of the patterns belonging to all classes except the th class, = 1, . . . , . The primal problems of MPSVM [24] are comprised of the following QP problem: min , , where 1 ∈ ( − ) and 2 ∈ are the vectors of ones, is the slack variable, and > 0 is the penalty parameter, = 1, . . . , . The dual problem of QP problem (13) is formulated as follows: where the penalty parameter > 0, = [ 1 ], and = [ 2 ], = 1, 2, . . . , . Similarly, in order to avoid the possibility of the ill-conditioning of the matrix T in some situations, one introduces a regularization term , where > 0 is a fixed small scalar and is the identity matrix with appropriate size.
After getting the solution [ T ] T = −( T + ) −1 T to the above QP problem (13) with = 1, . . . , , a new pattern ∈ is assigned to class ( ∈ {1, . . . , }), depending on which of the hyperplanes given by (11) lies farthest from; that is, the decision function is represented as where |⋅| is the absolute value.

The Framework of Nonparallel Hyperplanes Classifiers
In this section, we propose a learning framework of nonparallel hyperplanes classifier, which gives a unified form for TWSVM and its many extension versions and extend them into multiclass classification problem. We first develop the linear framework and then extend it to nonlinear framework.

Linear Framework.
Given the training set (10), the task is to find nonparallel hyperplanes: one for each class. For obtaining the unknown hyperplanes, we construct the following standard framework for each unknown hyperplane: where the matrix is comprised of the patterns in the th class, the matrix is defined (12), * ≥ 0 and > 0 are the parameters, 1 and 2 are vectors of ones of appropriate dimensions, = 1, 2, . . . , , and (⋅, ⋅) is the loss function (e.g., square loss, hinge loss, etc.). In the optimization problem (17), the first term approximatively minimizes the sum of the squared Euclidean distances from the patterns except for the th class to hyperplanes; the second term is the Tikhonov regularization term [25] and can implement the structural risk minimization principle like TBSVM [13]; the third term constitutes the loss function which is defined different loss functions corresponding to different models.
For a new pattern ∈ , we assign to class ( = 1, 2, . . . , ) according to the following decision function: where |⋅| is the absolute value. Note that we only use the absolute value | T + | in the decision function. There are two main reasons: one is that the first term of the optimization problem (17) just minimizes the sum of the square rather 4 The Scientific World Journal than the sum of square Euclidean distance from the patterns to hyperplanes, so it should keep consistency between the optimization problem and the decision function; another is that it reduces the computational cost particularly when the kernel function is introduced afterwards. In fact, if = 2, the parameter * is equal to 0, and the loss function is hinge loss function, that is, (1, ( )) = max(0, 1− ( )), then the optimization problem (17) becomes TWSVM [12]. Moreover, if the parameter * > 0 is alterable, then it is TBSVM [13]. And if the loss function is the square loss function, that is, is LS-TWSVM [14]. And if the loss function is a convex combination of linear and square loss, that is, where ∈ (0, 1), then it is NPPC [18]. Other extension versions of TWSVM also can be contained in the optimization problem (17), for instance, smooth TWSVM, 1-norm LS-TWSVM [17], and so forth, in which we just need to select proper norm or loss function. More importantly, our framework can solve multiclass classification problem, which is extension of TWSVM, TBSVM, LS-TWSVM, NPPC, and so forth. It should be pointed out that our framework is not straight extension of TWSVM and its deformation versions. Concretely, from the optimization problem (17), we can see that the first term contains the patterns except for those of the th class and the third term just involves the patterns of the th class. This strategy cannot lead to significant increase of the complexity of the optimization when the number of classes increases. We will dwell on in specific algorithm afterwards. Now, we give the detailed algorithm to the hinge loss function as an example, called hinge NPSVM (HNPSVM). And then the optimization problem (17) is the following formulation with the hinge loss function: where the matrix is comprised of the patterns in the th class, the matrix is defined (12), * ≥ 0 and > 0 are the parameters, 1 and 2 are vectors of ones of appropriate dimensions, and = 1, 2, . . . , . Actually, the problem is equivalent to the following quadratic programming: min , , where the matrix is comprised of the patterns in the th class, the matrix is defined (12), * ≥ 0 and > 0 are the parameters, 1 and 2 are vectors of ones of appropriate dimensions, and = 1, 2, . . . , .
In fact, for = 1, 2, . . . , , we have QP problems like (20). In particular, when is equal to 2, that is, = 1, 2, the QP problems (4) and (5) can be obtained as a special case of (20) with * = 0. For simplicity, assume that the number of each class points is almost balanced; namely, the number of the th class is = / . Then, note that the constraints just involve the patterns of the th class, so the complexity of the the problem (20) is no more than ( ) if TWSVM is directly extended to multiclass classification case like [22], we will get a different optimization problem, in which the roles of patterns of the th class and the rest class are switched. Thus, the complexity of the optimization problem will increase significantly and is determined by the patterns except for the patterns of the th class in the training set (10), which is no more than (( − 1) ( / )) 3 . Obviously, our approach is approximately ( − 1) 3 times faster than the model in [22]. On the other hand, when the number of each class points is unbalanced, our apprach still is faster than the model in [22] because the complexity of our optimization problem just is decided by the number of the patterns of the th class rather than the patterns of the rest classes. Therefore, our HNPSVM keeps the computation complexity low.
It is well known that the solution of primal problem (20) is obtained from the solutions of their dual problems. So we now derive their dual problems. The Lagrangian function of the problem (20) is given by where , are nonnegative Lagrange multiplier vectors. The Karush-Kuhn-Tucker (KKT) necessary and sufficient optimality conditions [26] for the QP problem (20) are given by Since ≥ 0, according to (24), we have Next, from (22) and (23), we can obtain where is an identity matrix of appropriate dimensions. Let T ; (29) can be written as  (21) and using (22)-(28), we can get the dual problem of the primal problem (20): where * > 0 and > 0 are parameters and = 1, 2, . . . , . Obviously, if we have the solution of the QP problem (31), then we obtain the nonparallel hyperplanes (16) by (30).
It is worth mentioning that the parameter * replaces as in (8), so * is no longer a fixed small scalar but a weighting factor which determines the trade-off between the regularization term and the empirical risk in the problem (20). Therefore, the high and low of the value of * reflects the structure of minimization principle and our HNPSVM includes MBSVM.

Nonlinear Framework.
Similarly, we also extend the linear framework of NPSVMs to nonlinear case. For aclass classification (10), our goal is to find kernel-generated hyperplanes: where = [ 1 , . . . , ] and ( , T ) is an appropriately chosen kernel function.
In order to obtain the hyperlanes (32), we construct the following framework formulation: where the matrix is comprised of the patterns in the th class, the matrix is defined (12), * ≥ 0 and > 0 are the parameters, 1 and 2 are vectors of ones of appropriate dimensions, = 1, 2, . . . , , and (⋅, ⋅) is the loss function (e.g., square loss or hinge loss, etc.). Similarly, as discussed in the last subsection, the problem (33) can be reduced to the nonlinear formulations of the difference approaches (e.g., TWSVM, TBSVM, LS-TWSVM, NPPC, etc.) when the difference loss functions or parameters are selected for = 2.
A new pattern ∈ is assigned to the th class by the following decision functions: where |⋅| is the absolute value. Note that, in this decision function (34), we just compute the absolute value rather than Euclidean distance from the pattern to the hyperplanes. This strategy reduces the complexity of computation because Euclidean distance should be | ( , T ) + |/√ T ( T , T ) from the pattern to the th hyperplanes. Thus, the decision function (34) not only saves the computation quantity but also keeps the consistency with the first term of the problem (33). Now, we still select the hinge loss function as example. Then, the problem (33) can be formulated as follows: min , , where the matrix is comprised of the patterns in the th class, the matrix is defined by (12), * > 0 and > 0 are parameters, 1 and 2 are vectors of ones of appropriate dimensions, and = 1, 2, . . . , . Similarly, derived process with the linear case, its dual problem is formulated as:

SOR Algorithm.
In our HNPSVMs, the QP problems (31) and (36) can be rewritten as the following unified forms: where ∈ × is positive definite. For example, the above problem becomes the problem (36), when = ( T + * ) −1 T , = .
The above problem (37) can be solved efficiently by the following successive overrelaxation (SOR) algorithm; see [27].
(2) Suppose that is obtained by the times iterate; compute +1 according to the following iterate formula:  The Scientific World Journal (3) Stop if ‖ +1 − ‖ is less than some desired tolerance. Else, replace by +1 and by + 1 and go to 2.
SOR is an excellent TWSVM solver, because it can process efficiently very large datasets that need not reside in memory. Furthermore, it has been proved that this algorithm converges linearly to a solution in [27,28]. It should be pointed out that we employ the Sherman-Morrison-Woodbury formula [29] for the inversion of matrix ( T + * ) and, hence, need only to invert matrix with a lower order , instead of the order . Further, in practise, if the number of patterns in the th classe is large, then the rectangular kernel technique [30,31] can be applied to reduce the dimensionality of our nonlinear classifiers.

Several Others Approaches.
In this section, we briefly give several extension versions based on our framework by selecting different loss function or replacing 2-norm.
First, if the square loss function is chosen, that is, (1, ( )) = (1 − ( )) 2 , then we can get the following formulation from the framework (17): where the matrix is comprised of the patterns in the th class and the matrix is defined by (12), * > 0 and > 0 are parameters, 1 and 2 are vectors of ones of appropriate dimensions, and = 1, 2, . . . , . This is extension version of LS-TWSVM [14].
Second, if we replace 2-norm with 1-norm in the problem (39), then we can get the extension of 1-norm LS-TWSVM [17] as follows: where the matrix is comprised of the patterns in the th class, the matrix is defined by (12), * > 0 and > 0 are parameters, 1 and 2 are vectors of ones of appropriate dimensions, and = 1, 2, . . . , .
These approaches have the same decision function (18) and can be extended into nonlinear case. And their solving methods can construct based on their binary algorithms.

Numerical Experiments
In this section, we present experimental results of our binary HNPSVM (BHNPSVM) and multiclass HNPSVM (MHNPSVM) on both artificial and benchmark datasets. In experiments, we focus on the comparison between our methods and some state-of-the-art classification methods, including SVM, GEPSVM, TWSVM, "1-v-1, " "1-v-r, " and MBSVM. All the classification methods are implemented in MATLAB 7.0 [32] environment on a PC with Intel P4 processor (2.9 GHz) with 1 GB RAM. In order to give the fastest training speed, we employ Libsvm [33] to implement the SVM, "1-v-1, " and "1-v-r". Our BHNPSVM and MHNPSVM and TWSVM and MBSVM are implemented using SOR technique; GEPSVM is implemented by simple MATLAB functions like "eig, " respectively. As for the problem of selecting parameters, we employ standard 10-fold cross-validation technique [34]. Furthermore, the parameters for all methods are selected from the set {2 −8 , . . . , 2 8 }.

Toy Examples.
Firstly, we consider a simple two-dimensional "Cross Planes" dataset as Example 1, which was tested in [11,13] to indicate that nonparallel hyperplanes classifiers can handle the cross planes dataset much better compared with parallel ones. Now, we show that our BHNPSVM also can handle cross-planes type data well due to use of our decision function. The "Cross Planes" dataset is generated by perturbing points lying on two intersecting lines. Figures  1(a)-1(d) show the dataset and the linear classifiers obtained by SVM, GEPSVM, TWSVM, and our BNPSVM. It is easy to see that the result of our BNPSVM is more reasonable than that of SVM, and better than that of GEPSVM and TWSVM. In addition, we list the accuracy and CPU time for these four classifiers in Table 1. From Table 1, we can see that our BNPSVM obtains the best accuracy while not the slowest computing time.
Secondly, we consider a two-dimensional three-class dataset as Example 2 to show the operating mechanism of our MNPSVM and other multiple-class classifiers. The threeclass dataset is generated by perturbing points lying on three The Scientific World Journal  intersecting lines. Figures 2(a)-2(d) show the dataset and the linear classifiers obtained by "1-v-1, " "1-v-r, " MBSVM, and MHNPSVM. It is easy to see that the result of MBSVM and MHNPSVM is more reasonable than that of "1-v-1" and "1-vr. " We also list the accuracy and CPU time of Example 2 for these four classifiers in Table 1. From Table 1, we can see that our MHNPSVM obtains the best accuracy in all these two examples, indicating that our MHNPSVM is suitable for both "Cross Planes" and multiclass problems.

Benchmark Datasets.
In order to further compare our methods with others, we examine nine binary-class datasets 8 The Scientific World Journal and nine multiclass datasets used by [12,35], from the UCI Repository of machine learning database [36]. Table 2 gives the details of these eighteen datasets.
In order to compare the behavior of our linear BHNPSVM with SVM, GEPSVM, and TWSVM, the numerical experimental results for binary-class UCI datasets are summarized in Table 3. In Table 3, the classification accuracy and computation time are listed. In Table 3, the best accuracy is shown by bold figures. It is easy to see that most of the accuracies of our linear BHNPSVM are better than linear SVM, GEPSVM, and TWSVM on these datasets. It can also be seen that our BHNPSVM is a little faster than TWSVM and is competitive with SVM (implements by Libsvm). We also list the mean accuracy and mean time for these four classifiers. Our BHNPSVM gains the the highest mean accuracy while faster training speed than TWSVM. Table 4 is concerned with our kernel BHNPSVM, SVM, GEPSVM, and TWSVM on binary-class UCI datasets. The Gaussian kernel ( , ) = − ‖ − ‖ 2 is used. The kernel parameter is also obtained through searching from the range from 2 −8 to 2 8 . The training CPU times for these four classifiers are also listed. The results in Table 4 are similar to those appearing in Table 3 and therefore confirm the above conclusion further. In order to compare the behavior of our MHNPSVM with other multiple-class classifiers, we compare our MHNPSVM with "1-v-1, " "1-v-r, " and MBSVM, the linear results of numerical experiments on multiclass UCI datasets are summarized in Table 5. In Table 5, the classification accuracy and computation time are listed.
From Table 5, we can see that the accuracy of linear MHNPSVM is significantly better than linear MBSVM on all 9 UCI datasets. We also obtain that MHNPSVM and MBSVM are almost same fast because they both solve two SOR algorithms instead of two QP problems with the same size. In contrast, classification accuracy of "1-v-1" and "1-v-r" is no statistical difference with MHNPSVM for all cases except for vowel dataset, and "1-v-1" and "1-v-r" are a bit lower than  Data  #Ins  #Fea  #class  Data  #Ins  #Fea  #class  Hepatitis  155  19  2  Votes  435  16  2  WBPC  198  34  2  Sonar  208  60  2  Heart-statlog  270  13  2  BUPA  345  6  2  Pima-Indian  768  8  2  CMC  1473  9  2  Australian  690  14  2  Iris  150  3  4  Wine  178  3  13  Ecoli  336  8  8  Vowel  528  11  10  Glass  214  6  13  Vehicle  846  4  18  Car  1728  6  4  Segment  2310  7  19  Satimage  4435  6  36 #Ins is the number of the training points; #attributes is the number of attributes; #class is the number of class. MHNPSVM and MBSVM in average training time. Thus, with the proposed formulation of MHNPSVM allows the classifier to learn better by reducing the generalization errors. However, this improved performance is obtained at the cost of more tuning effort involved. This is because MHNPSVM requires tuning of more parameters than MBSVM. Table 6 shows the nonlinear MHNPSVM with "1-v-1, " "1v-r, " and MBSVM, the results of numerical experiments. In Table 6, the classification accuracy and computation time are listed. The results in Table 6 are similar to those appearing in Table 5; MHNPSVM has better classification accuracy than MBSVM in eight datasets, while MBSVM is better than MHNPSVM in one dataset, and MHNPSVM and MBSVM are much faster than "1-v-1" and "1-v-r", especially when the amount of data increases.

Conclusions
In this paper, a general framework of nonparallel hyperplanes support vector machines, termed NPSVMs, are proposed for binary classification and multiclass classification. For binary classification, this framework includes TWSVM and its many deformation versions, for instance, TWSVM, TBSVM, LS-TWSVM, NPPC, and so forth, when different loss functions and parameters are selected. For multiclass classification, we do not directly extend TWSVM and its deformation versions to get the framework, in which we switch the roles of the patterns of the th class and the rest classes. This strategy does not lead to significant increase of the computation complexity when the number of classes is increasing. Moreover, in the decision function, "min" and Euclidean distance in TWSVM 10 The Scientific World Journal   are replaced by "max" and the absolute value | T + |, respectively. The absolute value | T + | is not only simpler but also more consistent with the primal problems. In particular, we discuss the linear and nonlinear case of the framework with the hinge loss function as example. Moreover, we also give the primal problems of several extensions of TWSVM's deformation versions. The numerical experiments on several artificial and benchmark datasets indicate that our NPSVMs yield comparable generalization performance compared with SVM, GEPSVM, TWSVM, MBSVM, "1-v-1, " and "1-v-r". In short, the proposed framework not only includes TWSVM and its many deformation versions but also extends them into multiclass classification under keeping the merit of TWSVM (learning speed).
In the future, we will develop the idea of nonparallel hyperplanes classifiers to other problems such as ordinal regression, multi-instance, and multilabel classification.