PMSVM: An Optimized Support Vector Machine Classification Algorithm Based on PCA and Multilevel Grid Search Methods

We propose an optimized Support VectorMachine classifier, named PMSVM, inwhich SystemNormalization, PCA, andMultilevel Grid Search methods are comprehensively considered for data preprocessing and parameters optimization, respectively. The main goals of this study are to improve the classification efficiency and accuracy of SVM. Sensitivity, Specificity, Precision, and ROC curve, and so forth, are adopted to appraise the performances of PMSVM. Experimental results show that PMSVM has relatively better accuracy and remarkable higher efficiency compared with traditional SVM algorithms.


Introduction
The swift development of machine learning technologies gives us a good chance to process and analyse data in a brandnew perspective.Machine learning, also known as knowledge discovery, is one of the most important branches of computer science, which aims to find useful patterns from data and is quite different from those traditional statistical methods.As a comparatively new machine learning algorithm, Support Vector Machine (SVM) has attracted much attentions recently and has been successfully used in various application vocations [1][2][3][4][5][6].In this study, we focus on constructing an optimized SVM model, so as to use it on heart disease data classification, aiming to improve the classification efficiency and accuracy of SVM.
Many literatures have involved contents of using Support Vector Machine to deal with data.Muthu Rama Krishnan et al. designed a SVM based classifier, which was used on two UCI mammogram datasets for breast cancer detection and reached the accuracy of 99.385% and 93.726%, respectively [7].Xie and Wang integrated a hybrid feature selection method with SVM for erythematosquamous disease diagnosis, which reached the accuracy of 98.61% [8].
Feature selection is the basis of machine learning algorithms; appropriate feature selection strategy can obviously improve the performances of machine learning methods.Deisy et al. proposed a novel information theory based feature selection algorithm to improve the classification accuracy for SVM classifiers [9][10][11][12].Other feature selection methods such as mutual information measurement [13], kernel -score feature selection, and explicit margin-based feature elimination method are often adopted to get better classification results for SVM or other machine learning algorithms [14][15][16][17].
Most of the machine learning algorithms have their parameters; proper measures should be taken to decide the optimized values of them.Genetic Algorithm, Particle Swarm Optimization Algorithm, Artificial Immune System Algorithm and Grid Search Method are those most often used parameter optimization algorithms [18][19][20][21][22][23].Generally, data feature selection methods and parameter optimization strategies are comprehensively considered.Lin et al. developed a Simulated Annealing approach for parameter determination and feature selection in SVM, and experiments showed the good performance of it [24].Tan et al. proposed a new hybrid approach, in which Genetic Algorithm and Support

Mathematical Derivation of Support Vector Machine
Support Vector Machine (SVM), as one of the most effective machine learning algorithms used for classification or regression problems, was firstly proposed by Vapnik and his Colleagues in 1995 and its history can be backtracked to the basic works of the Statistical Learning Theory since the 1960s [1][2][3][4].SVM is good at processing nonlinear, high dimension, and little sample machine learning problems.SVM is built on the basis of the VC Dimension (Vapnik Chervonenkis Dimension) Theory and the Structural Risk Minimum Theory, which are the core contents of the Statistical Learning Theory [2].SVM has both solid theoretical foundation and ideal generation ability [6].Presently, SVM has been used in many domains and occasions, such as handwriting recognition, biological character recognition (e.g., face recognition), credit card cheat checking, image segmentation, bioinformatics, function fitting, and medical data analysis [6].
As has been mentioned, SVM can be used to solve classification and regression problems, and SVM in these two different occasions are called SVC and SVR, respectively.In this paper, only SVC is involved, and it is uniformly referred to as "SVM." Intuitively (e.g., in 2-dimensional space), the classification problems can be divided into Linearly Separable tasks (corresponding data is made of linearly separable samples) and Linearly Inseparable tasks (data is formed with linearly inseparable samples, also called nonlinearly separable data, or nonlinear data, for short.). Figure 1 shows these two cases.Such situations can be extended into high dimensional space.

Linearly Separable Case.
For the sake of simplicity, we only consider two-class classification situations.
Given a dataset ,  = {(  ,   )}  =1 , where   ∈   ,   ∈ {−1, 1}.Essentially, dataset  is a set of binary group (  ,   ); here   = { 1 ,  2 , . . .,   } stands for a data sample, and   = ±1 is the corresponding class label of   .Particularly, we use  to represent arbitrary data sample, and its corresponding class label is represented with .In such situations, SVM searches for the hyperplane with the largest separation "Margin" [1][2][3][4], that is, the Maximum Marginal Hyperplane (MMH).A separating hyperplane can be written as where  = { 1 ,  2 , . . .,   } is weight vector;  is called as bias.Let () =  ⋅  + , the geometrical distance from a sample  to the optimal hyperplane can be expressed as here, () is referred to as discriminant function.The purpose of SVM is to find the parameters  and , so as to maximize the margin of separation ( in (5)).Without loss of generality, the function margin can be fixed to be equal to 1. Thus, given a training set The particular samples (  )  satisfy the equalities of (3) which are referred to as Support Vectors; that is, they are exactly the closest samples to the optimal separating hyperplane.Accordingly, the geometrical distance from the Support Vector  * to the optimal separating hyperplane is Obviously, the margin of separation  is To get the maximum margin hyperplane is to maximize  with respect to  and .
The constrained optimization problem in (7) is known as the Primal Problem.Through the method of Lagrange Multiplier, we construct the following Lagrange Function: where   is the Lagrange Multiplier with respect to the  th inequality.We can get the following conditions of optimality, through differentiating (, , ) with respect to  and  and setting the results equal to 0: thus, we obtain The corresponding Dual Problem can be inferred by means of substituting (10) The following equation (12) gives the Karush-Kuhn-Tucker (KKT) complementary condition: As a result of it, only the support vectors (  ,   ) (the closest samples to the optimal separating hyperplane, which determine the maximal margin) correspond to the nonzero    (all the other    equal to zero); (11), which describes the Dual Problem, is a typical Convex Quadratic Programming Optimization Problem.In most cases, the Convex Quadratic Programming Optimization Problem function can efficiently converge to the global optimum, by means of adopting some appropriate optimization techniques.We can acquire the optimal weight vector  * with (13) after determining the optimal Lagrange multipliers  *  : therefore, the corresponding optimal bias  * can be expressed as follows: that is, The Primal Problem can be described as min The corresponding Dual Problem Function (Dual Function) of the soft margin is formulated as KKT complementary condition in such inseparable case is where   's are the Lagrange Multipliers corresponding to   that have been introduced to enforce the nonnegativity of   .At the point (saddle point), at which the derivative of the Lagrange function for the primal problem with respect to   is zero, the evaluation of the derivative yields Simultaneously considering ( 20) and ( 21), we acquire Consequently, the optimal weight  * can be expressed as follows: The optimal bias  * can be obtained by means of taking any sample (  ,   ) in the training set, for which we have 0 <  *  <  and the corresponding   = 0, and using the sample in ( 19) 2.2.2.Kernel Trick Based SVM.For linearly inseparable cases, kernel trick is another commonly used technique.An appropriate kernel function, which is based on the inner product between the given samples, is to be defined as a nonlinear transformation of samples from the original space to a feature space with higher or even infinite dimension so as to make the problems linearly separable.That is, a complicated classification problem cast in a high-dimensional space nonlinearly is more likely to be linearly separable than in a low-dimensional space.Actually, we can adopt a nonlinear mapping function to map data  in original (or primal) space into a higher (ever infinite) dimension space , such that the mapped data in new feature space is more likely linearly separable.Thus we can extend the separating hyperplane function into the following form: that is, The separating hyperplane is The primal problem in such case is min Using the same mathematical trick in linearly separable SVM, we get the corresponding Dual Function The KKT complementary condition is After getting the optimal Lagrange Multipliers  *  , we acquire the optimal weight vector and the corresponding bias that are described with function (32) and function (33), respectively: Fortunately, the inner product like the form of (  )(  ) can be instituted with (  ,   ), that is, thus, we only need to calculate inner product in the original space not in the feature space, which reduces the calculation complexity obviously; meanwhile, we are refrained from searching for the proper nonlinear mapping function ().
In fact, the kernel trick cannot always guarantee the mapped problems to be absolutely linearly separable, so soft margin SVM and kernel based SVM are integrated to exert the different advantages of them two and thus can solve the linearly inseparable problems more efficiently.The corresponding Dual form for the constrained optimization problem in the kernel soft margin SVM is listed as follows: Accordingly, the optimal classifier is where The most frequently used kernel functions are (i) (, ) =  ⋅ ; (ii) (, ) = ( ⋅  + ) ℎ ,  ≥ 0; The third one is the RBF kernel function, which is used in our experiments.

Process of Principal Component Analysis
Principal Components Analysis (PCA), as was firstly introduced by Pearson in 1901, is a methodology that can be used to reduce the number of explicative variables of a dataset.In this paper, PCA is used for dimension reduction.
The main procedure of PCA is introduced as follows.
Given a dataset   , where   = { 1 ,  2 , . . .,   },   ∈   ,  = 1, 2, . . ., .Essentially, dataset   is a  ×  matrix.Firstly,   should be normalized to get matrix with  normalized  dimensional column vectors, in case that attributes with large domains dominate with smaller ones (the following procedures will be exerted on these column vectors).Then,  orthonormal vectors will be calculated to act as a basis for the normalized input data, which are referred to as the principal components.Let  1 ,  2 , . . .,   denote these  mutually perpendicular unit vectors; they should satisfy the following requirements: COV (  ,   ) = 0, where ,  ∈ {1, 2, . . ., } ,  ̸ = .(38) Thirdly, these principal components are sorted descendingly according to their "significance." Finally, those first several components will be chosen to reconstruct a good approximation of the original data.Thus the dimension size of the data is reduced; that is, PCA can be used to reduce data dimension.

The Proposed Methods
In this paper, we select the Stalog Heart Disease Dataset, which is from the UCI (University of California at Irvine) Machine Learning Repository, as our experimental data.The dataset contains 270 tuples; each tuple includes thirteen data attributes and one class attribute.The detail of the dataset is described with Table 1.
Because the value ranges of different attributes vary greatly, the tuples in Stalog Heart Disease Dataset are preprocessed as follows.Firstly, to facilitate the following experiments, we change the values of the attribute "class" into 1 and −1, respectively.That is, let 1 represent "presence, " and −1 represent "absence." Then tuples in the dataset are normalized by column to [−1, 1].In the normalized process, only the front thirteen data attributes are considered.We name such attribute data normalization process as System Normalization.The actual normalization procedure for each column vector   ,  = 1, 2, . . ., , in Dataset   is illustrated with (39).
Let  (,) represent the  th element of   .

𝐷
To demonstrate the performance of SVM classifier, we often adopt cross validation method to get the accuracy of the class model.Traditionally, when using -fold cross validation, the input dataset is randomly divided into  subsets, so that all the subsets have almost the same number of samples.In the training phase of the classification model, each subset will act as testing subset only once and will act as training subset  − 1 times.In other words, when using fold classification validation,  classification models will be founded, each of them using  − 1 subsets as training set, and the remaining one subset as testing set.The final classification performance is to be appraised by using the average result of the  classification models.In this paper, the stratified sample technology is adopted to generate the  folds (subsets); thus each subset will have the same number of positive samples and the same number of negative samples; the ratio between positive samples and negative samples of each subset is just the same as that of the whole dataset.That is, the acquired subsets will keep the statistical distribution of the original dataset.Algorithm 1 describes the main process of the Stratified Cross Validation.

Multilevel Grid Search.
In this paper, we select -SVC as our SVM classifier, and the RBF kernel as the kernel function.Thus we need to decide the values of the punishment parameter  of -SVC and the parameter  of the RBF kernerl function.Generally, Grid Search technique is the most frequently used skill for deciding the optimal value of these two parameters.Given initial value ranges and search steps for the two parameters, respectively, Grid Search algorithm will iteratively check every value pair of  and  to try to obtain the optimal values of the value pair.Different settings of the initial value ranges and the steps for these two parameters will influence greatly the optimum results.Meanwhile, because the Grid Search method is of typical exhaustive search technology, the search process is time consuming.
Here, we present an adapted grid search method named Multilevel Grid Search (MGS), which effectively spares the search time and gets the same search result as the traditional grid search process.
Algorithm 2 reveals the details of our Multilevel Grid Search method.
Figures 2 and 3 show the parameters optimization results of the MGS method.Figure 2 Figures 2(a) and 3(a) show that PCA has little influence on the classification accuracy of SVM, and averagely, experimental results in our study show that when adopting proper PCA threshold, PCA based SVM has similar classification accuracy as traditional SVM algorithms.

Experimental Results
In this paper, the UCI Stalog heart-disease dataset is used to test our method.The main purpose of our study is to create a more efficient and available SVM model.The original dataset is processed based on the holdout method; that is, the given data are partitioned into two independent sets, one as training set, the other as test set.Twothirds of the original data are allocated to the training set; the remaining one-third is allocated to the test data.Firstly, we repeat stratified subsampling based holdout method 10 times to generate 10 stratified datasets S-DATA1, S-DATA2, . . ., S-DATA10; then, random subsampling based holdout method is repeated 10 times to generate R-DATA1, R-DATA2, . . ., R-DATA10; in this paper, the experiments are mainly based on S-DATA1, S-DATA3, . . ., S-DATA9, and R-DATA2, R-DATA4, . . ., R-DATA10.

Mathematical Problems in Engineering
SVM is a promising machine learning algorithm, which has solid theory foundation and good generalization ability.But the training time of even the fastest SVMs can be extremely slow.In this paper, PCA is adopted to reduce the dimension of our data.Here, the threshold of PCA is 85%.
Figure 5 shows the PCA results on the Stalog Herat Disease Dataset.As shown in Figure 5, the System Normalization process has no obvious effect on PCA.Further study shows that, when adopting proper threshold (in this study, it is 85%), PCA has no direct influence on the classification accuracy of SVM. Figure 6 shows the ROC curve of SVM classifier on two different sampled Heart Disease Datasets: one is random sampled and the other is stratified sampled.It is obvious that PCA based SVM has similar classification accuracy as traditional SVM, but it can reduce the time complexity in some extent (about a 25% reduction).
We name the classifier based on PCA and MGS methods as PMSVM, and we call the classifier based on PCA strategy as PLSVM.All experimental results are compared with that of the famous LIBSVM algorithm.To demonstrate the performances of PMSVM, PLSVM with that of LIBSVM, Confusion Matrix, Sensitivity, Specialty, Precision, ROC curve, and AUC are used as the main evaluative criteria for classification accuracy; meanwhile, the time overhead is checked for measuring the efficiency of our method.All SVM algorithms are based on -SVC classification model and RBF kernel function.
Confusion Matrix is a useful tool to a classifier for describing the classification results for different classes.Given  classes, the Confusion Matrix is a × matrix, the element  , describes the number of samples that a classifier classifies class  into class .For an ideal classifier that has 100% classification accuracy, all the samples should be described with the elements on the main diagonal of the Confusion Matrix.That is, let ‖‖ represent the number of total samples of dataset ; we can get the following: "Specificity" is the ratio between True Negative and Neg.In another word, it is the ratio of truly sampled negative samples.Specificity can be defined as Precision gives the ratio between the number of truly classified positive samples and the total number of samples that are classified as true samples.Precision can be described as ROC curve is a useful visual tool for comparing the performances of different classification models.ROC is the abbreviation of Receiver Operating Characteristic.ROC curve displays the comparative performance evaluation between the true positive rate and the false positive rate of a given classifier.Here, AUC means the area under the curve (ROC), and it can express the accuracy of a given classification model.Figure 7 demonstrates the ROC curve of and LIBSVM are acquired through Multilevel Grid Search and Grid Search, resp.). Figure 9 demonstrates the time overhead of PMSVM, PLSVM, and LIBSVM.
In the whole experiment, Stratified Sampling and Random Sampling methods show no direct influences on classification results of PMSVM, PLSVM, and LIBSVM, but Stratified Cross Validation provides more reliable results.And the influence of Stratified Sample technology on the classification results of imbalanced data is what we will study further in our future work.

Conclusions
In this paper, an optimized Support Vector Machine classifier, named PMSVM, is proposed, in which System Normalization, PCA, and MGS methods are used to try to average up the performances of SVM.Experimental results show that System Normalization can effectively assure the classification accuracy (Figure 4).PCA (Principal Component Analysis) can play a role of dimensionality reduction.As shown in our experiments, chosing proper threshold value, PCA can both economize classification time and assure classification  accuracy.The most prominent character in our study is whenever adopting our PCA and Multilevel Grid Search (MGS) methods, the time consuming of our PMSVM algorithm can be reduced by 75% or so, averagely, and meanwhile can get similar or better classification accuracy than the classical LIBSVM algorithm.

Figure 1 :
Figure 1: Linear separable data and nonlinear separable data.
(a) gives the Grid Search results on S-DATA3; Figure 2(b) demonstrates the Grid Search results on PCA and System Normalization process based S-DATA3; Figures 2(c)-2(d) demonstrate the Multilevel Grid Search process on PCA and System Normalization process based S-DATA3; the situation of Figure 3 is about the same as Figure 2, but its experimental data is R-DATA6.

Figure 5 :
Figure 5: PCA results on two different datasets.

Table 1 :
Description of the UCI Stalog Heart Disease Dataset.

Table 2
gives an example for a 2-class classifiers, where, True Positive represents the number of positive samples that are correctly classified as positive samples; Fause Negative represents the number of positive samples that are wrongly classified as negative samples; True Negative represents the number of negative samples that are correctly classified as negative samples; False Positive represents the number of negative samples that are wrongly classified as positive samples.The following results of Sensitivity, Specialty, Precision, and so forth, closely rely on the contents of Confusion Matrix.Let Pos represent the total number of positive samples, and let Neg represent the total number of negative samples; it is easy to get the following relations, which are described as Pos = True Positive + False Negative , Neg = True Negative + Fase Positive .

Table 3 :
Comparison of the accuracy and efficiency of algorithms on stratified sampled data.

Table 4 :
Comparison of the accuracy and efficiency of algorithms on random sampled datasets.