A Parallel Genetic Algorithm Based Feature Selection and Parameter Optimization for Support Vector Machine

The extensive applications of support vector machines (SVMs) require efficient method of constructing a SVM classifier with high classification ability. The performance of SVM crucially depends on whether optimal feature subset and parameter of SVM can be efficiently obtained. In this paper, a coarse-grained parallel genetic algorithm (CGPGA) is used to simultaneously optimize the feature subset and parameters for SVM. The distributed topology and migration policy of CGPGA can help find optimal feature subset and parameters for SVM in significantly shorter time, so as to increase the quality of solution found. In addition, a new fitness function, which combines the classification accuracy obtained from bootstrap method, the number of chosen features, and the number of support vectors, is proposed to lead the search of CGPGA to the direction of optimal generalization error. Experiment results on 12 benchmark datasets show that our proposed approach outperforms genetic algorithm (GA) based method and grid search method in terms of classification accuracy, number of chosen features, number of support vectors, and running time.


Introduction
The overwhelming amount of data that is currently available in any field provides great opportunities for researchers to obtain knowledge that is impossible to obtain before.However, the enormous amount of data also requires the ability of efficiently extracting the essential knowledge from existing data and generalizing the obtained knowledge to the future unseen new data.Support vector machines (SVMs), proposed by Vapnik [1], have become the references for many classification problems because of their flexibility, computational efficiency, and capability of handling high dimensional data.Despite all the promising results that SVMs provided, it is still a challenge to efficiently construct a SVM classifier which can provide accurate prediction on the unseen new samples.This so-called generalization ability crucially depends on two tasks, namely, feature selection and parameter optimization [2][3][4].
Feature selection is used to identify a subset of available features which is most essential for classification.Feature selection is important for a variety of reasons, including generalization performance, computational efficiency, feature interpretability, and learning convergence [5][6][7].Classification problems typically involve a number of features.However, not all of these features are equally important for a specific task.By extracting the essential information from a given dataset while using the smallest number of features, one can save significant computation time and build classifiers that have better generalization ability.
Along with feature selection, parameter optimization is another key factor that affects the generalization ability of SVMs.Proper parameter setting can not only improve the classification ability of a learned SVM model, but also lead to an efficient classification on the unseen new samples.The parameters that need to be optimized include the error penalty parameter  and the kernel function parameter, such as parameter  for the Gaussian kernel function.The performance of a SVM largely depends on the choice of parameter.Thus, the selection of parameter is an important research topic in the study of SVMs [8][9][10][11][12][13].
Both feature selection result and parameter setting have significant impact on the accuracy and efficiency of SVMs.Besides, the choice of feature selection and the setting of parameter are influenced by each other, and independently performing these two tasks might result in a loss of classification ability [2,4].Motivated by these views, the trend in recent years is to turn these two tasks into a multiobjective optimization problem so that global search algorithms, such as genetic algorithm (GA) [2,14,15], particle swarm optimization (PSO) [3], and ant colony optimization (ACO) [4], can be used to jointly perform these two tasks.However, jointly performing these two tasks results in a largely expanded solution space, and it requires strong search ability to find optimal feature subset and parameter for SVMs.Besides, given the fact that training SVM even only once needs a great deal of computations, it will be computationally infeasible to apply these global search algorithms into practical use, when the number of training samples increases.
The aim of this paper is to present an efficient and effective method of constructing SVM classifier, so that SVMs can be applied into wider range of practical use and provide promising results.In this paper, a coarse-grained parallel genetic algorithm (CGPGA) is used to jointly select feature subset and optimize parameters for SVMs.The key idea of CGPGA is to divide the whole GA population into several separate subpopulations, and each subpopulation can search the whole solution space in parallel way.After every certain number of generations, best individual of each subpopulation will migrate to other subpopulations.The distributed topology and the migration policy can significantly accelerate the process of feature selection and parameter optimization, so as to increase classification accuracy of SVM.
Another key issue addressed in this paper is the design of a proper fitness function which can be used to assess the true generalization ability of learned SVM and direct the search of CGPGA to the direction of optimal generalization error.An essential part in model selection process (i.e., choosing one classifier over another) is to evaluate the performance of classifiers and choose the best one.However, the classifier derived from the training data is often overoptimistic, due to overspecialization of the learning algorithm to the data [16].In this paper, a new fitness function, which combines classification accuracy obtained from k-fold bootstrap method, the number of chosen features, and the number of support vectors, is proposed to measure the generalization ability of learned SVM.Experiments on 12 benchmark datasets show that our proposed method not only achieves higher classification accuracy, smaller feature subset, and smaller number of support vectors, but also takes significantly shorter processing time.
The remainder of this paper is organized as follows.A brief introduction to the SVM is given in Section 2. Section 3 introduces basic concept of parallel genetic algorithms.Section 4 gives a detailed description of our proposed approach.The results of our evaluation are given in Section 5. Section 6 concludes this paper.

Support Vector Machines
2.1.Linear SVM.First, we briefly describe the SVM formulation.SVM is designed for binary-classification problems.Given the training data (  ,   ),  = 1, . . ., ,   ∈   and   ∈ {+1, −1}, where   is the input space,   is the sample vector, and   is the class label of   .A hyperplane in the feature space can be described as    +  = 0, where  is normal to the hyperplane and  is a scalar.The distance () from a point   in the feature space to the hyperplane is When the training samples are linearly separable, the SVM finds an optimal separating hyperplane that maximizes the minimum value of (), by solving the following optimization problem: For linearly nonseparable case, there is no such a hyperplane that is able to classify every training sample correctly.In order to relax the separable case to nonseparable one, the slack variable   is introduced into the optimization problem: where parameter  is the tuning parameter used to balance the margin and the training error.
In the classification phase, a sample  in the feature space is assigned a label  according to the following equation: Among a variety of kernel functions available, the generally used kernel functions include Linear kernel:

Parallel Genetic Algorithms
GAs are stochastic search algorithm based on principles of natural selection and recombination.They attempt to find optimal solution to the problem at hand by manipulating a population of candidate solutions.The population is evaluated and the best solutions are selected to reproduce and mate to form the next generation.After a number of generations, good traits dominate the population, resulting in an increase in the quality of the solutions.In most cases, GAs are efficient enough to find acceptable solutions.However, while being applied to more complex problems, they suffer the risk of premature convergence to local optima [17] and large increase in the time required to find adequate solutions.
There have been multiple efforts [18,19] to make GAs faster, and one of the most promising choices is to use parallel implementations.The basic idea behind most PGAs is to divide the whole population into several subpopulations and evolve all the subpopulations simultaneously using multiple processors.A PGA basically consists of various GAs, and each processes a part of population or independent subpopulation, with or without communication between them.Therefore, PGAs can increase the diversity of population and significantly reduce computation time.
There are three main types of PGAs: (1) master-slave type, (2) fine-grained type, and (3) coarse-grained type [18].A master-slave PGA acts like GA and does not affect the behavior of the algorithm.This model uses a single global population and the fitness evaluation is distributed among available processors or cores.Since, in this type of PGAs, selection and crossover consider the entire population, it is also known as global PGA.As for the fine-grained algorithm, the population is divided into a large number of very small subpopulations, which are maintained by different processors.In ideal case, each processor will be allocated only one individual.This method is rarely utilized; due to that it strictly requires too many processors and high communication cost for each generation.
The coarse-grained type is also known as distributed GA or island model, which divides the whole population into a few large subpopulations.Genetic operators are carried out within the subpopulation.After several generations,   individuals from different subpopulations will be exchanged and form the new subpopulations for further evolution.The exchange process is named as "migration," which is the essential part inside the CGPGA that could diversify the population and prevent the premature convergence.The schematic of CGPGA is given in Figure 1.
In this paper, CGPGA is applied to simultaneously select feature subset and optimize parameters of SVM.The whole population is divided into several subpopulations.Each of these subpopulations will take an independent evolution, and different evolutionary strategies will be applied on them.After every certain number of generations, the best individual will migrate to every other subpopulation and replace the worst one.

Method
The chromosome design, fitness function, and system architecture of the proposed CGPGA are described as follows.

Chromosome Design.
The design of chromosome is an important step for the proposed CGPGA method.In this step, Gaussian kernel is chosen as kernel function of SVM classifier.Each chromosome comprises three parts, parameters C and  and feature subset.Binary code is used to present the chromosome.Figure 2 shows the binary chromosome representation of our design.
In Figure 2, where p is phenotype (true value) of bit string, min  is minimum value of the parameter, max  is maximum value of the parameter,  is decimal value of bit string, and n is length of bit string.
In the coding of feature subset, "1" indicates that the feature is selected and "0" indicates that the feature is not selected.  represents the number of features in the original dataset.

Fitness Function.
Fitness function is an essential part of CGPGA.It evaluates the performance of each individual in the population and predicts which one has the best generalization ability.The classification accuracy obtained from kfold bootstrap method, the number of selected features, and the number of support vectors in a SVM model are used to construct a fitness function.All these measurements have been proven to be good indicators of good generalization ability [5,20,21].A high fitness value will be assigned to the individual with high classification accuracy, small number of chosen features, and small number of support vectors.The fitness function is fitness =   × Accuracy +    +   V, where   is classification accuracy weight, Accuracy is average prediction accuracy of 5-time bootstrap,   is weight of feature score,  is score of chosen feature subset, and for   "1" indicates that th feature is selected."0" indicates that th feature is not selected.  is the number of original features,   is weight of support vector value and is set to 0.05 in our experiment, v is score of support vectors number, for   "1" indicates that th sample is a support vector, and l is the number of samples in training set.

System Architectures of the Proposed CGPGA-SVM.
System architecture of CGPGA-based feature selection and parameters optimization method is shown in Figure 3. Main steps are described as follows.
(1) Input Dataset.Input dataset includes all the labeled samples.It will be randomly split into a training set and a testing set using bootstrap method.Training set is used to construct the SVM model while testing set is used to test the generalization ability of learned SVM.
(2) Preprocess the Data.Data preprocess is important for a variety of reasons.It can avoid attributes in greater numeric ranges dominating those in smaller numeric ranges and increase SVM accuracy [2].Each feature of the dataset can be linearly scaled to the range [0, 1] by where V is the original value, V  is the scaled value, max is upper bound of the feature value, and min is the low bound.
(3) Initialize the Population.Generally, the original population is randomly generated.However, in our experience, it is useful to randomly generate the genotype of parameters  and  but select all the features.This will make the first generation of CGPGA run like a grid search procedure.
(4) Decide the Topology and Migrating Strategy.The important characteristics of CGPGA are the use of a few relatively large subpopulations and migration.By dividing the whole population into several separate subpopulations, one can apply different searching strategies (i.e., different crossover rates and mutation rates) to different subpopulations.Specifically, the whole population will be divided into 2 subpopulations, and each subpopulation has 60 individuals.After every 10 generations, the best individual in each subpopulation will be sent to other subpopulations and replace the worst individual.The purpose of this constant communication is to ensure a good mixing of individuals.
(5) Apply Genetic Operation.Genetic operations, such as selection, crossover, and mutation, will be applied to generate better solutions.However, in a CGPGA, genetic operations will be carried out within the subpopulation, which means different subpopulations will take different crossover rates and mutation rates.
(6) Get the Parameters.This step refers to converting each parameter from its genotype into phenotype.The converting of parameters can be done by ( 8).
(7) Select Feature Subset.According to the binary code of feature set in each chromosome, related features can be chosen and unrelated features will be discarded.After training dataset and testing dataset discard unrelated features, they can be used to construct the SVM model and test the generalization ability of learned SVM.
(8) Evaluate the Individuals Using Bootstrap.Each individual in the population refers to a certain pair of parameters (, ) and a certain choice of feature selection.To obtain a reliable performance estimate of this individual, bootstrap method will be used  times.During each phase of bootstrap, 50% of the samples in the input dataset will be randomly chosen as the training set, while the rest of samples will be chosen as the testing set.Training set is used to construct the SVM model with the chosen parameters and feature subset, and testing set is used to predict the classification ability of learned SVM.The  classification accuracies from the k-time bootstrap then can be averaged to produce a single classification accuracy.After obtaining the classification accuracy, the fitness value can be calculated by (9).It must be mentioned that the evaluation of an individual is independent from the rest of the population, and there is no need to communicate during this phase.Thus, the evaluation of individuals is parallelized by assigning a fraction of the population to each of the available processors.
(9) Termination Criteria.In our approach, if the generation number reaches generation 100 or the highest fitness value of the whole population does not improve during the last 30 generations, process will stop.To evaluate the classification ability of the proposed approach in different classification tasks, 12 real world datasets from the UCI database [23] have been adopted.Their number of classes, number of samples, and number of original features are shown in Table 1.

Experiments
In order to show the effectiveness of our proposed method, we conduct several comparisons between our proposed method and two other methods, including grid search method and GA-SVM [2].Specifically, grid search is a widely used method of parameter optimization.In most cases, grid search can get a satisfactory result.GA-SVM, proposed by Huang and Wang [2], is the most widely used feature selection and parameter optimization method for SVMs.It can deal with feature selection and parameter optimization simultaneously and provide promising results.
To guarantee the result obtained by proposed method is valid, we adopt k-fold cross validation.The dataset will be partitioned into  independent subsets randomly, and the size of each subset is approximately equal.The k-fold cross validation process is then repeated  times, with each of  subsets being used exactly once as the testing dataset, while the remaining ( − 1) datasets are used as a training set.The training set will be used as the input dataset of our proposed approach, and the performance of obtained parameters and feature subset will be tested on the testing set.The  results from the folds then can be averaged to produce a single estimation.In our experiment,  is set to 10.The evaluation procedure of our proposed approach using a 10fold cross validation is shown in Figure 4. Take the Australian dataset as an example; the best pairs of parameters (, ), the classification accuracy, the number of chosen features, and the number of support vectors for each fold obtained by our proposed approach and grid search method are shown in Table 4. 2 gives the comparison of our proposed approach CGPGA-SVM and GA-SVM [2].Tenfold cross validation is used to estimate the classification accuracy of each approach.The obtained classification accuracy is illustrated with the form of "average ± standard deviation."As shown in Table 2, our proposed approach achieves higher classification accuracies on 11 datasets, except on "Svmguide1."

Classification Results. Table
Table 3 gives the experiment results of our proposed approach and grid search.Tenfold cross validation is used to estimate the classification accuracy of each approach.The obtained classification accuracy is illustrated with the form of "average ± standard deviation."As we can see, the proposed approach produces smaller feature number, and grid search uses all the original features.Besides, the proposed approach achieves higher classification accuracy.To validate if this higher classification accuracy actually indicates stronger classification ability, we used nonparametric Wilcoxon-signedrank test for all the 12 datasets.As shown in Table 4, the  with the number of training samples.In fact, the proportion of support vectors in model produced by CGPGA-SVM was maintained at a low level.This could enable the practitioners to apply the SVM to wider fields where classification has to be done in great speed, for example, online applications.

Computational Efficiency.
A serious limitation of global search methods is that they involve high computational complexity.By using parallelization strategy, our proposed approach can significantly reduce the running time, while getting an enough adequate solution.We run our proposed approach and GA-SVM on all 6 datasets and recorded the average time involved in one generation of evolution.To get a fair enough comparison, both our approach and GA-SVM have 120 individuals in their populations.Table 5 gives the  results.As we can see, on a common used 4-core CPU, our approach takes significant shorter time than GA-SVM.

Limitations and Conclusions
The overwhelming amount of data that is currently available in any field poses new challenges for machine learning techniques.To extract essential knowledge from these enormous data and generalize obtained knowledge to the future unseen new data, two problems must be efficiently addressed for SVM, namely, feature selection and parameter optimization.
The number of input features in a classifier should be limited without losing its predictive power.With a smaller feature set, the classification decision is more easily explained and can be made in shorter time.Parameter optimization is another important factor that affects the generalization ability of SVMs.With a proper setting of parameters, the classification accuracy on the unseen new patterns can be ensured.This work investigated a hybrid CGPGA-based model that hybridized the coarse-grained parallel genetic algorithm and support vector machines to maintain the classification accuracy with a small and suitable feature subset.The distributed topology and migration policy of CGPGA enable us to search the solution space with different search strategies in parallel way, thereby providing strong search ability and high efficiency.
Experiment results obtained from several real world datasets of UCI database showed promising performance in terms of 10-fold accuracy, the size of selected feature subset, the number of support vectors, and training time.They revealed that the proposed approach not only optimized SVMs' model parameters, but also correctly obtained the discriminating feature subset in an efficient way.Moreover, the proportion of support vectors in model produced by our method was maintained at a low level.This could result in faster classification on the unseen new pattern and extend the applications of SVM to wider fields where classification has to be done in great speed.Despite all of the promising results, our proposed work also has its limitations.Training SVM is a computationintensive task.Our proposed work does not reduce the number of SVM models constructed during the optimization procedure or the time needed for training one SVM.The acceleration is achieved by assigning multiple SVM learning processes to each of the available processors, so that the SVM learning processes can be done in parallel way and the computational resources can be used in cost-efficient way.When the problem becomes large enough, our proposed work may show its burden.

Figure 1 :
Figure 1: The schematic of coarse-grained type PGA. b

Figure 2 :
Figure 2: The chromosome comprises three parts, parameters C and  and feature subset.

+ 1 Figure 3 :
Figure 3: System architecture of the proposed method.

Figure 4 :Figure 5 :
Figure4: The procedure of experiment on the benchmark dataset using our proposed approach. 
,   ) = (  ) ⋅ (  ) is incorporated to simplify the computation of the inner product value.The kernel function (  ,   ) = (  ) ⋅ (  ) gives the inner product value of   and   in the feature space.
is the number of bits representing parameter , and   is the number of bits representing parameter .Note that the selection of   and   is according to the computational precision.Besides, the binary code representing the genotype of the parameters (, ) should be transformed into phenotype by

Table 3 :
Comparisons between CGPGA-SVM and grid search.

Table 4 :
Experiment results for Australian dataset using our proposed approach and grid search.

Table 5 :
Comparisons of computational efficiency.