New Fuzzy Support Vector Machine for the Class Imbalance Problem in Medical Datasets Classification

In medical datasets classification, support vector machine (SVM) is considered to be one of the most successful methods. However, most of the real-world medical datasets usually contain some outliers/noise and data often have class imbalance problems. In this paper, a fuzzy support machine (FSVM) for the class imbalance problem (called FSVM-CIP) is presented, which can be seen as a modified class of FSVM by extending manifold regularization and assigning two misclassification costs for two classes. The proposed FSVM-CIP can be used to handle the class imbalance problem in the presence of outliers/noise, and enhance the locality maximum margin. Five real-world medical datasets, breast, heart, hepatitis, BUPA liver, and pima diabetes, from the UCI medical database are employed to illustrate the method presented in this paper. Experimental results on these datasets show the outperformed or comparable effectiveness of FSVM-CIP.


Introduction
Computer techniques such as machine learning and pattern recognition have been widely adopted by modern medicine. One reason is that an enormous amount of data has to be gathered and analyzed which is very hard or even impossible without making use of computer techniques. The other reason is that computer techniques have led toward digital analysis of pathological diagnosis, automatic classification differentiating, and detecting diseases. In some cases, an early symptom of some diseases is lighter and gives no obvious pointer to a possible diagnosis; moreover, many symptoms look very similar to each other, though they are caused by different diseases. So it may be difficult even for experienced doctors to make correct diagnosis. Therefore, an automatic classification system can help doctor diagnose accurately, assess disorders remotely and evaluate the treatment process [1].
In recent years, researchers have proposed a lot of approaches for medicine classification, such as neural network, Bayesian network, and support vector machine (SVM). Among them SVM is considered to be one of the most successful ones [2]. For example, to improve time and accuracy in differentiating diffuse interstitial lung disease for computer-aided quantification, a hierarchical SVM is introduced which shows promise for various real-time and online image-based classification applications in clinical fields [3]. SVM as a classifier is used for liver disorders and its correct classification rate is highly successful compared to the other results attained [4]. A two-stage approach is proposed for medical datasets classification, in which the artificial bee colony algorithm is used for feature selection and SVM is used for classification [5].
The support vector machine (SVM) proposed by Vapnik [6,7] is a novel approach for solving pattern recognition problems. SVM maps the sample points into a highdimensional feature space to seek for an optimal separating hyperplane through maximizing the margin between two classes. In addition, SVM is a quadratic programming (QP) problem that assures that its solution is obtained once it is the global unique solution, and the sparsity of solution assures better generalization. However, most of the real-world medical datasets usually contain some outliers and noisy examples. The classical SVM is very sensitive to outliers/noise. To solve this problem, fuzzy support vector machine (FSVM) [8] is proposed, in which each sample is given a fuzzy membership that denotes the attitude of the corresponding point toward 2 The Scientific World Journal one class. The membership represents how important the sample is to the decision surface.
Nevertheless, many medical datasets are composed of "normal" samples with only a small percentage of "abnormal" ones, which leads to the so-called class imbalance problems. FSM does not take into consideration the class distribution and can be sensitive to the class imbalance problem. As a result, the hyperplane of FSVM can be skewed towards the minority class, and this skewness can degrade the performance of FSVM with respect to the minority class. To tackle this problem, Veropoulos et al. [9] have proposed a method called different error costs (DEC), where the SVM objective function has been modified to assign two different misclassification cost values. It is noticed that One-Class Classification [10,11] is sometimes used in novelty detection, and it only uses the normal training data. However, in many real medical datasets, abnormal examples exist, although they are very few. Furthermore, in classification tasks, the scatter matrix can play an important role when incorporated with local intrinsic geometry structures of samples [12]. Some methods have been recently proposed to incorporate the structure of the data distribution into SVM. A linear manifold learning method named locality preserving projection (LPP) is proposed in [13,14], which aims at preserving the local manifold structure of the samples space. Although LPP considers enhancing the local data compactness with each manifold, it does not separate manifolds with different class labels.
In this paper, we propose a new FSVM method for the class imbalance problem (FSVM-CIP) which can be used to address both the problem of class imbalance and outliers/noise. FSVM-CIP not only considers the fuzziness of each training sample but also extends manifold regularization and maximizes the localized relative margin. It takes the positive samples and negative samples into consideration with different misclassification costs according to their unbalanced distributions. We systematically evaluated the FSVM-CIP on five real-world medical datasets and compared its performance with four different SVM methods for classification. The results showed that the proposed method can improve the classification accuracy and handle the classification problems with outliers/noise and imbalanced datasets more effectively.
The rest of this paper is organized as follows. Section 2 briefly reviews the related works. Section 3 presents the details of FSVM-CIP in the linear case. Section 4 presents FSVM-CIP in the nonlinear case in detail. The experimental results on five medical datasets are reported in Section 5, and some concluding remarks are given in Section 6.

Fuzzy Support Vector Machines (FSVMs).
In traditional SVM, all the data points are considered with equal importance and assigned the same penal parameter in its objective function. However, in many real-world classification applications, some sample points, such as the outliers or noises, may not be exactly assigned to one of these two classes, and each sample point does not have the same meaning to the decision surface. To solve this problem, the theory of fuzzy support vector machine was originally proposed in [8]. Fuzzy membership to each sample point is introduced such that different sample points can make different contributions to the construction of decision surface.
Suppose the training samples are where x ∈ R is the -dimension sample point, ∈ {−1, +1} represents its class label, and ( = 1, . . . , ) is a fuzzy membership which satisfies ≤ ≤ 1 with a sufficiently small constant > 0. The quadratic optimization problem for classification is considered as follows: min w, , where w is a normal vector of the separating hyperplane, is a bias term, and is a parameter which has to be determined beforehand to control the tradeoff between the classification margin and the cost of misclassification error. Since is the attitude of the corresponding point x towards one class and the slack variables are a measure of error, then the term can be considered a measure of error with different weights. It is noted that the bigger the is, the more importantly the corresponding point is treated; the smaller the is, the less importantly the corresponding point is treated; thus, different input points can make different contributions to the learning of decision surface. Therefore, FSVM can find a more robust hyperplane by maximizing the margin by letting some misclassification of less important points. In order to solve the FSM optimal problem, (2) is transformed into the following dual problem by introducing Lagrangian multipliers : Compared with the standard SVM, the above statement only has a little difference, which is the upper bound of the values of . By solving this dual problem in (3) for optimal , w and can be recovered in the same way as in the standard SVM.

Locality Preserving Projections (LPP).
Locality preserving projection (LPP) [13,14] is a linear dimensionality reduction algorithm by feature extraction or projection. It builds an adjacency graph incorporating neighborhood information of the data set using the Laplacian graph and then computes a transformation matrix which maps the data points into a subspace. This linear transformation optimally preserves local neighborhood information in a certain sense. The representation map generated by this method can be The Scientific World Journal 3 viewed as a linear discrete approximation to a continuous map that naturally arises from the geometry of the manifold.
For a set = {x } ( ∈ [1, ]), let (x ) denote nearest neighbors of node , and let denote the adjacency graph of dataset . Here, the th node corresponds to the data point and nodes and are connected by an edge if node is among the nearest neighbors of node or if node is among the nearest neighbors of node ; that is, x ∈ (x ) or x ∈ (x ). The adjacency graph can be weighed as follows: where exp(−‖x − x ‖ 2 / ) is called the heart kernel function and is a constant. ‖x − x ‖ is the Euclidean distance in R between point and point . LPP tries to find the transformation vector w ∈ R by minimizing the following objective function: where D is a diagonal matrix whose entries are column sum of W and = ∑ normalizes each weight. L = D − W is the Laplacian matrix. The transformation vector w in the objective function in (5) is given by the minimum eigenvalue solution to the generalized eigenvalue problem. LPP preserves the intrinsic geometry and local structure of the data by minimizing the objective function.

FSVM for the Class Imbalance Problem in the Linear Case
In this section, we first define the local within-class preserving scatter matrix in the linear case. Secondly, the optimization problem formulation of FSVM-CIP in the linear case is given. Moreover, the fuzzy membership functions for linear FSVM-CIP are defined. Finally, the algorithm of linear FSVM-CIP is summarized.

The Local within-Class Preserving Scatter Matrix in the
Linear Case. Following the idea of [15], we build the nearest within-class neighbor graph to model intrinsic geometry and local structure of the data. The graph preserves local neighborhood information in a certain sense and it can be viewed as a linear discrete approximation to a continuous map that naturally arises from the geometry of the manifold.
Considering the fact that we have a binary classification problem, one class denoted as 1 contains sample points x with = 1 and the other class denoted as 2 contains sample points x with = −1. Set | 1 | = 1 and | 2 | = − 1 , and the total number of sample points is .

Definition 1.
For each data x , suppose its nearest withinclass neighbors set (x ) and an edge is put between x and its neighbors. The corresponding weight matrix is where = ∑ normalizes each weight.
Definition 2. The local within-class preserving scatter matrix where I ( ) is an × diagonal matrix. In this case, the obtained nearest within-class neighbor graph attempts to preserve the local structure of the data set and (I ( ) − W ( ) ) (I ( ) − W ( ) ) preserves locality of nearby points with same class label in the embedding space during the unfolding process of nonlinear structures [15]. In fact, a heavy penalty is applied to the objective function through the weight if the neighboring data x and x are mapped far apart. Hence, the minimization criterion is an attempt to ensure points and close to each other as well as x and x being close.
It is worthwhile to note that the local within-class scatter matrix S is symmetric and positive semidefinite. S looks similar to the within-class scatter matrix S [16,17] and the Laplacian matrix L in LPP. However, S reflects the intrinsic geometry and local structure of the data, and S only considers the mean value of samples in different classes. S carries the class label information and discriminating information but L only considers the information of nearest neighbors for each data point in the input space, without considering the class labels.

FSVM-CIP in the Linear Case.
To tackle the imbalance classification problem with noise and outliers, we integrate FSVM, the ideas of imbalance classification problem, and the local within-class preserving scatter. On one hand, as shown in Figure 1, the linear classifier presented by the hyperplane is (w x + = 0) and defines a field for majority-class examples (w x+ > 1− ) and another field for minority-class examples (w x + > −(1 + − )) which is used to weaken the skewness towards the minority class and enhance the locality 4 The Scientific World Journal Figure 1: The hyperplanes of linear FSVM-CIP.
maximum margin. On the other hand, by assigning a higher misclassification cost for the minority class examples than the majority class examples, the effect of class imbalance could be reduced. In addition, to minimize the amount of misclassifications, the local within-class scatter matrix S is used to preserve intrinsic geometry and local structure of the data. Due to this, we define the primal problem of FSVM-CIP as follows: min , , , where 1 , 2 denote the number of positive (normal class or majority class) and negative (abnormal class or minority class) training points, and 2 = − 1 . is a nonnegative number, and + 1 is the margin between the hyperplane and the minority class examples. is a nonnegative regulation constant which is the tradeoff between the local within-class scatter and the margin. Variables V 1 , V 2 are positive penalty parameters, which tune penalty cost of the training error for positive and negative training data, respectively. , ≥ 0 are the slack variables, and , are fuzzy memberships for twoclass examples.
Obviously, w S w provides prior geometrical information into the penalty terms based on manifold regularization. Minimizing w S w means that close data originally in the same class in the input space are likely to be close in the output place. Therefore, w S w aims to preserve the local information of the manifold structure.
It is noted that, in FSVM-CIP, we assign different fuzzy membership values for training examples to reflect their different classes of importance. We also showed that it is similar to assign different misclassification costs /V 1 1 ( /V 2 2 ) for different training examples. In order to reduce the effect of class imbalance, we can assign higher membership values or lower parameter V 2 for the minority class examples, while we assign lower membership values or higher V 1 for the majority class. That is, our proposed method would not tend to skew the separating hyperplane towards the minority class examples as the minority class examples are now assigned with a higher misclassification cost. By means of setting /V 1 1 ( /V 2 2 ) and extending manifold regularization, the learned optimal separating hyperplane enhances the relative maximum margin and FSVM-CIP will be less sensitive to imbalanced class problems.
Then, we transform this problem into its corresponding dual problem as follows.
Equation (15) is a typical convex quadratic programming problem which is easy to be numerically solved. Suppose * = [ * 1 , . . . , * ] can be used to solve the above optimization problem, and then the optimal weight vector is Denote a training sample x (1 ≤ ≤ ) called a support vector (SV) if the corresponding Lagrange multiplier > 0. Denote the SV sets as SV 1 = {x | 0 < ≤ /V 1 1 , 1 ≤ ≤ 1 } and SV 2 = {x | 0 < ≤ /V 2 2 , 1 + 1 ≤ ≤ } while + and − denote the number of SVs in SV 1 and SV 2 , respectively. According to KKT condition, (15) becomes equations for the input data in SV 1 and SV 2 , respectively, with slack variables and being 0. Thus, the optimal thresholds * and * can be calculated. However, from the numerical perspective, it is better to take the mean value of * and * resulting from all such data. Therefore, the optimal thresholds * and * are computed by the following formula: As a result, the corresponding decision function of the linear FSVM-CIP will be Note that, to deal with the small sample size problem, (I + S ) is regularized by adding a scale multiple of the identity matrix S with I before any inversion takes place. Hence, (I + S ) is always nonsingular, and the inverse of (I+ S ) exists.
Following the terminology in [18], a training sample x (1 ≤ ≤ ) is called a margin error (ME) if the corresponding slack variable > 0. We give the following theorem for parameter selection later.
where + and − denote the mean fuzzy membership of MEs in the positive and negative classes; + and − denote the mean fuzzy membership of SVs in the positive and negative classes, respectively.
A proof of the above theorem can be found in Appendix.

Fuzzy Membership Functions in the Linear Case.
In FSVM, the fuzzy membership is used to reduce the effects of outliers or noises and different fuzzy membership functions have different influences on the fuzzy algorithm. Basically, the rule to assign proper membership values to data points can depend on the relative importance of date points to their own classes. In this paper, we consider two fuzzy membership functions given in [19]. Given the sequence of training points, denote the mean of positive class and negative class as + and − . Definition 4. The lin is called the linear fuzzy membership and lin can be defined as where is a small positive value, which is used to avoid lin becoming zero. ‖ ⋅ ‖ is the Euclidean distance.
Definition 5. The exp is called the exponential fuzzy membership and exp can be defined as where parameter ∈ [0, 1] determines the steepness of the decay.

FSVM for the Class Imbalance Problem in the Nonlinear Case
In this section, we extend the local within-class preserving scatter matrix and FSVM-CIP into feature space. Moreover, the fuzzy membership functions in feature space are defined. Finally, the algorithm of kernel FSVM-CIP is summarized.

Kernel Extension.
In order to handle nonlinear classification, the kernelization trick [20] is used to map thedimensional date points into an arbitrary reproducing kernel Hilbert space (RKHS) [21] via a mapping function : R → H; that is, x i → (x i ). Then a linear hyperplane (k) = (k)+ in feature space H would correspond to a nonlinear hyperplane in the original space R where , (k) ∈ H, k ∈ R , and ∈ R.
The weight matrixes W (1) and W (2) are the nonlinear version of W (1) and W (2) , respectively. W (1) and W (2) could be built by , and the nonlinear version of is where = ∑ is a normalizer. Thus, the kernel FSVM-CIP can be easily achieved by solving the following quadratic problem: min , , , Like its linear counterpart, the solution to this optimization problem can be easily found using Lagrange multipliers. By using the representer theorem, w can be given by w = ∑ =1 (x ). We obtain the dual form of the optimization problem: where M = YK Q −1 KY and Q = K+ K (1) (2) ) (I (2) − W (2) )K (2) . Vectors = [ 1 , . . . , ] , and = diag( 1 , 2 , . . . , ) is a diagonal matrix.
Equation (27) is a typical convex quadratic programming problem which is easy to be numerically solved. Suppose * = [ * 1 , . . . , * ] can be used to solve the above optimization problem; then the optimal weight vector * = Q −1 KY * . Therefore, the optimal thresholds * and * are computed by the following formula: Finally, a more robust decision function of kernel FSVM-CIP will be A proof of the above theorem can be found in Appendix. Next, we consider fuzzy membership functions in feature space.

Definition 7.
The lin is called the linear fuzzy membership in feature space and lin can be defined as where is a small positive value. ‖⋅‖ is the Euclidean distance.
Definition 8. The exp is called the exponential fuzzy membership in feature space and exp can be defined as where parameter ∈ [0, 1] determines the steepness of the decay. Consider Thus, the distance ‖ (x ) − (x + )‖ can be given by Likewise, the ‖ (x )− (x − )‖ can be given in a similar manner.

Solution.
Based on the above, we can state the approach of kernel FSVM-CIP as Algorithm 2.

Experiments and Discussions
To evaluate the performance of our proposed FSVM-CIP, in this section, FSVM-CIP is evaluated compared with other related representative methods, such as standard FSVM [8], SVDD [11], FSVM for class imbalance learning (FSVM-CIL) [22], and FSVM with minimum within-class scatter (WCS-FSVM) [23]. We implement FSVM-CIP using the linear fuzzy membership and the exponential fuzzy membership, respectively, which are represented as FSVM-CIP lin and FSVM-CIP exp . All the experiments are performed in Matlab (R2010a) on personal computer, whose configuration is as follows: CPU 2.99 GHz, 4.0 G RAM, and Microsoft Windows XP.

Data Preparation.
In this section, we use five realworld medical datasets from the UCI repository of machine learning database [24], to demonstrate the classification performance of the method proposed in this paper. These five medical datasets are breast, heart, hepatitis, BUPA liver, and pima diabetes. It is highly likely that these realworld datasets contain some outliers and noisy examples in different amounts [22]. In each of them, the positive class consists of the data corresponding to the healthy, normal, or benign cases, while the negative class contains the data for diseased, abnormal, or malignant cases. Further details of these datasets are provided in Table 1. This contains the total number of positive data #pos, the total number of negative data #neg, the number of positive training examples 1, the number of negative training examples 2, the positive-tonegative imbalance ratio Ratio, and the data dimensionality .

Performance Measure and Experimental Settings.
We used the geometric mean of sensitivity (sensitivity = proportion of the positives correctly recognized), specificity (specificity = proportion of the negatives correctly recognized),  and accuracy (accuracy = proportion of correctly classified instances) for the classifier performance evaluation in experiments, as commonly used in medical datasets classification research [7]. Like the existing SVM and FSVM algorithms, the solution is sensitive to the setting of the parameters. In order to evaluate the performance, a strategy is that a set of the parameters is given first and then the best cross-validation mean rate among the set is used to estimate the generalized accuracy. We adopt this strategy in this paper. For the kernel-based methods, we use a Gaussian RBF kernel, that is, exp(−( − V) ( − V)/ ), where is the spread of Gaussian kernel, and is searched in { 2 /16, 2 /8, 2 /4, 2 /2, 2 , 2 2 , 4 2 , 8 2 , 16 2 }, where 2 is the mean norm of the training data.
For parameter selection, we conduct fivefold crossvalidation in a stratified manner so that each validation set has the same positive to negative ratio as in the training set. Finally, the experiment is repeated 10 times independently of each dataset.

Experimental
Results. FSVM-CIP method test results developed for the breast, heart, hepatitis, BUPA liver, and pima diabetes datasets are given both in the linear case and nonlinear case. Tables 2, 3, 4, 5, and 6 display the comparison results with the other methods on these five databases, respectively.
The main observations from the performance comparisons include the following.
(1) We can see that, in many real-world applications, a linear classifier seems powerless. In terms of accuracy, kernel method can improve the classification performance for all five medical datasets.
(2) We can clearly observe that the FSVM-CIP outperforms other methods on almost datasets both in the linear case and nonlinear case, which gives higher accuracy. This fortifies the fact that the locality maximum margin and the local structure information presented by local within-class preserving scatter could improve classification performance; furthermore, the method of different misclassification costs based on the number of two classes is a sensitive learning solution to overcome the imbalance problem in SVMs.
(3) It is noted that, for all the datasets considered, the classification accuracy given by the FSVM-CIP exp setting is higher than the FSVM-CIP lin setting. Therefore, we can state that FSVM-CIP exp setting with the appropriate selection   of value would be an effective choice applied to any medical dataset. In other words, when dealing with medical datasets classification, the performance of the exponential fuzzy membership is better than linear fuzzy membership in FSVM-CIP.
(4) For breast and heart datasets, the class imbalance is not obviously shaped; WCS-FSVM yielded standard FSVM, SVDD, and FSVM-CIL. We can say that the performance can indeed be improved when the structure of the data is taken into consideration. For the other three datasets, the class  imbalance strikingly improved, the results given by standard FSVM and WCS-FSVM for datasets are biased towards the majority class represented as lower specificity and lower accuracy. These results justify the fact that these two methods are sensitive to the class imbalance problem. Meanwhile, SVDD and FSVM-CIL yielded standard FSVM and WCS-FSVM. BY assigning different misclassification costs for the minority class and majority class, the effect of class imbalance could be reduced.

Parameter Selection for Kernel FSVM-CIP exp .
The parameter > 0 is an essential parameter in our proposed method which controls the tradeoff between the local within-class scatter and the margin. Figure 2 shows the impact of parameter on the classification accuracy of FSVM-CIP exp in kernel case with each value of selected from log 2 ∈ {−5, −4.5, −4, . . . , 5.5, 6}. It can be seen that the best accuracy is obtained for all the datasets and therefore is searched in a reasonable range.
Compared with standard FSVM, the additional neighbor parameter is employed in FSVM-CIP. To evaluate the influence of this parameter on the performance, the classification accuracy of kernel FSVM-CIP exp for five medical databases is recorded for each value of in {3, 5, 7, 9, 11, 13, 15}. Figure 3 shows the results. It can be seen that the classification accuracy is not high when value is small and, by increasing , the classification accuracy increases; however, if continues to increase, the classification accuracy begins to drop severely down. It is because, when is too small, the number of nearest neighbors is sparse; when is too large, the number of nearest neighbors is excessive, so to preserve so much local relation may be inappropriate.

Conclusion
Computer tools have improved the medical practice implementation to a greater extent. Although computer tools cannot replace the doctors, they can make their work easier and more effective. In this paper, a new fuzzy support machine called FSVM-CIP, used for medical datasets classification, is proposed. The proposed method is based on local withinclass preserving scatter and assigned two misclassification costs in the SVM objective function, which is for learning from imbalance datasets in the presence of outliers/noise and enhancing the locality maximum margin. Experiments were performed on several UCI medical datasets with a comparison of the proposed method with several other related methods such as standard FSVM, SVDD, FSVM-CIL, and WCS-FSVM. Obtained results show that the performance of the proposed method is highly successful compared to other results attained and seems very promising. Finally, we can recommend that FSVM-CIP exp which uses the exponential fuzzy membership would be an effective choice for medical datasets classification applications. In future work, we intend to perform investigations to large-scale classification problems.

Appendix
Proof of Theorem 3 in Section 3.2.
Proof. According to the dual form of the optimization problem (15), we can derive (A.1) Likewise, according to the KKT conditions, ∑ =1 = V with > 0 satisfy = 0 by (12). According to (11), all samples with > 0 satisfy = 0. In view of (13), this implies that = /V 1 1 holds for every positive ME. Summing up over the positive MEs using (A.1), we have Furthermore, in view of (15), each SV in the positive class can control at most 1/V 1 1 to the ∑ 1 =1 ; as a result, Proof of Theorem 6 in Section 4.1.