SVM-RFE Based Feature Selection and Taguchi Parameters Optimization for Multiclass SVM Classifier

Recently, support vector machine (SVM) has excellent performance on classification and prediction and is widely used on disease diagnosis or medical assistance. However, SVM only functions well on two-group classification problems. This study combines feature selection and SVM recursive feature elimination (SVM-RFE) to investigate the classification accuracy of multiclass problems for Dermatology and Zoo databases. Dermatology dataset contains 33 feature variables, 1 class variable, and 366 testing instances; and the Zoo dataset contains 16 feature variables, 1 class variable, and 101 testing instances. The feature variables in the two datasets were sorted in descending order by explanatory power, and different feature sets were selected by SVM-RFE to explore classification accuracy. Meanwhile, Taguchi method was jointly combined with SVM classifier in order to optimize parameters C and γ to increase classification accuracy for multiclass classification. The experimental results show that the classification accuracy can be more than 95% after SVM-RFE feature selection and Taguchi parameter optimization for Dermatology and Zoo databases.


Introduction
The support vector machine (SVM) is one of the important tools of machine learning. The principle of SVM operation is as follows: a given group of classified data is trained by the algorithm to obtain a group of classification models, which can help predict the category of the new data [1,2]. Its scope of application is widely used in various fields, such as disease or medical imaging diagnosis [3][4][5], financial crisis prediction [6], biomedical engineering, and bioinformatics classification [7,8]. Although SVM is an efficient machine learning method, its classification accuracy requires further improvement in the case of multidimensional space classification and dataset for feature interaction variables [9]. Regarding such problems, in general, feature selection can be applied to reduce data structure complexity in order to identify important feature variables as a new set of testing instances [10]. By feature selection, inappropriate, redundant, and noise data of each problem can be filtered to reduce the computational time of classification and improve classification accuracy. The common methods of feature selection include backward feature selection (BFS), forward feature selection (FFS), and ranker [11]. Another feature selection method, support vector machine recursive feature elimination (SVM-RFE), can filter relevant features and remove relatively insignificant feature variables in order to achieve higher classification performance [12]. The research findings of Harikrishna et al. have shown that computation is simpler and can more effectively improve classification accuracy in the case of datasets after SVM-REF selection [13][14][15].
As SVM basically applies on two-class data [16], many scholars have explored the expansion of SVM on multiclass data [17][18][19]. However, classification accuracy is not ideal. There are many studies on choosing kernel parameters for SVM [20][21][22]. Therefore, this study applies SVM-RFE to sort the 33 variables for Dermatology dataset and 16 variables for Zoo dataset by explanatory power in descending order and selects different feature sets before using the Taguchi  2 The Scientific World Journal Number of attributes 33 16 Number of class 6 7 parameter design to optimize Multiclass SVM parameters and to improve the classification accuracy for SVM multiclass classifier. This study is organized as follows. Section 2 describes the research data; Section 3 introduces methods used through this paper; Section 4 discusses the experiment and results. Finally, Section 5 presents our conclusions.

Study Population
This study used the Dermatology dataset from University of California at Irvine (UCI) and the Zoo database from its College of Information Technology and Computers to conduct experimental tests, parameter optimization, and classification accuracy performance evaluation, using the SVM classifier.
In medicine, dermatological diseases are diseases of the skin that have a serious impact on health. As frequently occurring types of diseases, there are more than 1000 kinds of dermatological diseases, such as psoriasis, seborrheic dermatitis, lichen planus, pityriasis, chronic dermatitis, and pityriasis rubra pilaris. The Dermatology dataset was established by Nilsel in 1998 and contains 33 feature variables and 1 class variable (6-class).
Before feature selection, we conducted feature attribute coding. The feature attribute coding of Dermatology and Zoo databases is as shown in Tables 2 and 3.

Feature Selection.
Feature selection implies not only cardinality reduction, which means imposing an arbitrary or predefined cutoff on the number of attributes that can be considered when building a model, but also the choice of attributes, meaning that either the analyst or the modeling tool actively selects or discards attributes based on their usefulness for analysis. The feature selection method is a search strategy to select or remove some features of the original feature set to generate various types of subsets to obtain the optimum feature subset. The subsets selected each time are compared and analyzed according to the formulated assessment function. If the subset selected in step + 1 is better than the subset selected in step , the subset selected in step + 1 can be selected as the optimum subset.

Linear Support Vector Machine (Linear SVM)
. SVM is developed from statistical learning theory, as based on SRM (structural risk minimization). It can be applied on classification and nonlinear regression [6]. Generally speaking, SVM can be divided into linear SVM (linear SVM) and nonlinear SVM, described as follows.
(1) Linear SVM. The linear SVM encodes the training data of different types by classification with Class 1 as being "+1" and Class 2 as being "−1" and the mathematical symbol ; the hyperplane is represented as follows: where denotes weight vector, denotes the input dataset, and denotes a constant as a bias (displacement) in the hyperplane. The purpose of bias is to ensure that the hyperplane is in the correct position after horizontal movement. Therefore, bias is determined after training . The parameters of the hyperplane include and . When SVM is applied on classification, the hyperplane is regarded as a decision function: Generally speaking, the purpose of SVM is to obtain the hyperplane of the maximized marginal distance and improve the distinguishing function between the two categories of the dataset. The process of optimizing the distinguishing function of the hyperplane can be regarded as a quadratic programming problem: The original minimization problem is converted into a maximization problem by using the Lagrange Theory: Finally, the linear divisive decision making function is 4 The Scientific World Journal If ( ) > 0, it means the sample is in the same category as samples marked with "+1"; otherwise, it is in the category of samples marked with "−1. " When the training data include noise, the linear hyperplane cannot accurately distinguish data points. By introducing slack variables in the constraint, the original (3) can be modified into the following: where is the distance between the boundary and the classification point and penalty parameter represents the cost of the classification error of training data during the learning process, as determined by the user. When is greater, the margin will be smaller, indicating that the fault tolerance rate will be smaller when a fault occurs. Otherwise, when is smaller, the fault tolerance rate will be greater. When → ∞, the linear inseparable problem will degenerate into a linear separable problem. In this case, the solution of the above mentioned optimization problem can be applied to obtain the various parameters and optimum solution of the target function using the Lagrangian coefficient; thus, the linear inseparable dual optimization problem is as follows: Finally, the linear decision-making function is (2) Nonlinear Support Vector Machine (Nonlinear SVM). When input training samples cannot be separated using linear SVM, we can use conversion function to convert the original 2-dimensional data into a new high-dimensional feature space for linear separable problem. SVM can efficiently perform a nonlinear classification using what is called the kernel trick, implicitly mapping their inputs into highdimensional feature spaces. Presently, many different core functions have been proposed. Using different core functions regarding different data features can effectively improve the computational efficiency of SVM. The relatively common core functions include the following four types: (1) linear kernel function: (2) polynomial kernel function: (3) radial basis kernel function: (4) sigmoid kernel function: where the emissive core function is more frequently applied in high feature dimensional and nonlinear problems, and the parameters to be set are and , which can slightly reduce SVM complexity and improve calculation efficiency; therefore, this study selects the emissive core function.

Support Vector Machine Recursive Feature Elimination (SVM-RFE).
A feature selection process can be used to remove terms in the training dataset that are statistically uncorrelated with the class labels, thus improving both efficiency and accuracy. Pal and Maiti (2010) provided a supervised dimensionality reduction method. The feature selection problem has been modeled as a mixed 0-1 integer program [23]. Multiclass Mahalanobis-Taguchi system (MMTS) is developed for simultaneous multiclass classification and feature selection. The important features are identified using the orthogonal arrays and the signal-tonoise ratio and are then used to construct a reduced model measurement scale [24]. SVM-RFE is an SVM-based feature selection algorithm created by [12]. Using SVM-RFE, Guyon et al. selected key and important feature sets. In addition to reducing classification computational time, it can improve the classification accuracy rate [12]. In recent years, many scholars improved the classification effect in medical diagnosis by taking advantage of this method [22,25].

Multiclass SVM Classifier. SVM's basic classification
principle is mainly based on dual categories. Presently, there are three main methods, one-against-all, one-against-one, and directed acyclic graph, to process multiclass problems [26], described as follows.
(1) One-Against-All (OAA). Proposed by Bottou et al., (1994) the one-versus-rest converts the classification problem of categories into dual-category problems [27]. Scholars have also proposed subsequent effective classification methods [28]. In the training process, it must train dual-category SVMs. When training the th classifier, data in the th category is regarded as "+1" and the data of the remaining categories is regarded as "−1" to complete the training of dual-category SVM; during the testing process, each testing instance is tested by trained dual-category SVMs.
The classification results can be determined by comparing the outputs of SVM. Regarding unknown category , the The Scientific World Journal 5 decision function arg max =1,..., ( ) ( ) + can be applied to generate decision-making values, and category is the category of the maximum decision making value.
(2) One-Against-One (OAO). When there are categories, two categories can produce an SVM; thus, it can produce ( − 1)/2 classifiers and determine the category of the samples by a voting strategy [28]. For example, if there are three categories (1, 2, and 3) and a sample to be classified with an assumed category of 2, the sample will then be input into three SVMs. Each SVM will determine the category of the sample using decision making function sign(( ) Φ( ) + ) and adds 1 to the votes of the category. Finally, the category with the most votes is the category of the sample.
(3) Directed Acyclic Graph (DAG). Similar to OAO method, DAG is to disintegrate the classification problem categories into a ( − 1)/2 dual-category classification problem [18]. During the training process, it selects any two categories from categories as a group, which it combines into a dualcategory classification SVM; during the testing process, it establishes a dual-category acyclic graph. The data of an unknown category is tested from the root nodes. In a problem with classes, a rooted binary DAG has leaves labeled by the classes where each of the ( − 1)/2 internal nodes is labeled with an element of a Boolean function [19].

Feature Selection Based on SVM-RFE.
The main purpose of SVM-RFE is to compute the ranking weights for all features and sort the features according to weight vectors as the classification basis. SVM-RFE is an iteration process of the backward removal of features. Its steps for feature set selection are shown as follows.
(1) Use the current dataset to train the classifier.
(2) Compute the ranking weights for all features.
(3) Delete the feature with the smallest weight.
Implement the iteration process until there is only one feature remaining in the dataset; the implementation result provides a list of features in the order of weight. The algorithm will remove the feature with smallest ranking weight, while retaining the feature variables of significant impact. Finally, the feature variables will be listed in the descending order of explanatory difference degree. SVM-RFE's selection of feature sets can be mainly divided into three steps, namely, (1) the input of the datasets to be classified, (2) calculation of weight of each feature, and (3) the deletion of the feature of minimum weight to obtain the ranking of features. The computational step is shown as follows [12]. (2) Feature Sorting

Repeat the following process until = [].
To obtain the new training sample matrix according to the remaining features: = 0 (:, ).
Finding the features of the minimum weight: = arg min( ).
(3) Output: Feature Sorted List . In each loop, the feature with minimum ( ) 2 will be removed. The SVM then retrains the remaining features to obtain the new feature sorting. SVM-RFE repeatedly implements the process until obtaining a feature sorted list. Through training SVM using the feature subsets of the sorted list and evaluating the subsets using the SVM prediction accuracy, we can obtain the optimum feature subsets.

SVM Parameters Optimization Based on Taguchi Method.
Taguchi Method rises from the engineering technological perspective and its major tools include the orthogonal array and ratio, where ratio and loss function are closely related. A higher ratio indicates fewer losses [29]. Parameter selection is an important step of the construction of the classification model using SVM. The differences in parameter settings can affect classification model stability and accuracy. Hsu and Yu (2012) combined Taguchi method and Staelin method to optimize the SVM-based e-mail spam filtering model and promote spam filtering accuracy [30]. Taguchi parameter design has many advantages. For one, the effect of robustness on quality is great. Robustness reduces variation in parts by reducing the effects of uncontrollable variation. More consistent parts are equal to better quality. Also, the Taguchi method allows for the analysis of many different parameters without a prohibitively high amount of experimentation. It provides the design engineer with a systematic and efficient method for determining near optimum design parameters for performance and cost. Therefore, by using the Taguchi quality parameter design, this study conducts the optimization design of parameters and to enhance the accuracy of SVM classifier on the diagnosis of multiclass diseases.
This study uses the multiclass classification accuracy as the quality attribute of the Taguchi parameter design [21]. In general, when the classification accuracy is higher, it means the accuracy of the classification model is better; that is, the quality attribute is larger-the-better (LTB), and LTB is defined as: The Scientific World Journal

Evaluation of Classification Accuracy.
Cross-validation measurement divides all the samples into a training set and a testing set. The training set is the learning data of the algorithm to establish the classification rules; the samples of the testing data are used as the testing data to measure the performance of the classification rules. All the samples are randomly divided into -folds by category, and the data are mutually repelled. Each fold of the data is used as the testing data and the remaining − 1 folds are used as the training set. The step is repeated times, and each testing set validates the classification rules learnt from the corresponding training set to obtain an accuracy rate. The average of the accuracy rates of all testing sets can be used as the final evaluation results. The method is known as -fold cross-validation.
As shown in Table 4, regarding parameter , when = 10 and = {5, 10, 12}, the accuracy of the experiment is higher than that of the experimental combination of = 1 and = {5, 10, 12}; moreover, regarding parameter , the experimental accuracy rate in the case of = 5 and = {1, 10, 50, 100} is higher than that of the experimental combination of = 0.1 and = {1, 10, 50, 100}. The near optimal value of or may not be the same for different databases. Finding the appropriate parameter settings is important for the performance of classifiers. Practically, it is impossible to simulate every possible combination of parameter settings. And that is the reason why Taguchi methodology is applied to reduce the experimental combinations for SVM. The experimental step used in this study was first referred to the related study, ex, = [1, 3, 10, 30, 100], [31]; then set a possible range for both databases ( = 1∼100, = 1∼12). After that, we slightly adjusted the ranges to understand if there will be better results in Taguchi quality engineering parameter optimization for each database. According to our experimental result, the final parameter settings and range 10∼100 and 2.4∼10, respectively, for Dermatology database; the parameters settings and range 5∼50 and 0.08∼11, respectively, for Zoo databases. Within the range of Dermatology and Zoo databases parameters and , we select three parameter levels and two control factors, and , to represent parameters and , respectively. The Taguchi orthogonal array experiment selects 9 (3 2 ) and the factor level configuration is as illustrated in Table 5.
After data preprocessing, Dermatology and Zoo databases include 358 and 101 testing instances, respectively. The various experiments of the orthogonal array are repeated five times ( = 5); the experimental combination and observations are summarized, as shown in Tables 6 and 7. According to (13), we can calculate the ratio for Taguchi experimental combination #1 as LTB = −10 log 10 [ The Scientific World Journal 7  The calculation results of the ratios of the remaining eight experimental combinations are summarized, as in Table 6. The Zoo experimental results and ratio calculation are as shown in Table 7. According to the above results, we then calculate the average ratios of the various factor levels. With the experiment of Table 8 as an example, the average ratio 1 of Factor at Level 1 is Similarly, we can calculate the average effects of 2 and 3 from Table 6. The difference analysis results of the various factor levels of Dermatology and Zoo databases are as shown in Table 8. The factor effect diagrams are as shown in Figures  2 and 3. As a greater ratio represents better quality, according to the factor level difference and factor effect diagrams, the Dermatology parameter level combination is 1 3 ; in other words, parameters = 10, = 10, Zoo parameter level combination is 1 2 , and the parameter settings are = 5, = 4.
When constructing the Multiclass SVM model using SVM-RFE, three different feature sets are selected according to their significance. At the first stage, Taguchi quality engineering is applied to select the optimum values of parameters and . At the second stage, it constructs the Multiclass SVM Classifier and compares the classification performance according to the above parameters. In the Dermatology experiment, Table 9 illustrates the two feature subsets containing 23 and 33 feature variables. The 33 feature 8 The Scientific World Journal    sets are tested by SVM and SVM, as based on Taguchi. The parameter settings and testing accuracy rate results are as shown in Table 9. The experimental results, as shown in Figure 4, show that the SVM ( = 10, = 10) testing accuracy rate of the 17-feature sets datasets can be higher than 90%, which is better than the accuracy rate of 20-feature sets dataset SVM ( = 10, = 11), up to 90%. Moreover, regardless of how many sets of feature variables are selected, the accuracy of SVM ( = 50, = 2.4) cannot be higher than 90%.
Regarding the Zoo experiment, Table 10 summarizes the experimental test results of sets containing 6, 12, and 16 feature variables using SVM and SVM based on Taguchi. As shown in Table 10, the experimental results show that the classification accuracy rate of the set of 12-feature variables in the classification experiment using SVM-RFE-Taguchi ( = 10, = 10) is the highest, up to 97% ± 0.0396. As shown in Figure 5, the experimental results show that the classification accuracy rate of the dataset containing 7 feature variables by SVM-RFE-Taguchi ( = 50, = 2.4) can be higher than 90%, which can obtain relatively better prediction effects.

Conclusions
As the study on the impact of feature selection on the multiclass classification accuracy rate becomes increasingly attractive and significant, this study applies SVM-RFE and SVM in the construction of a multiclass classification method in order to establish the classification model. As RFE is a  feature selection method of a wrapper model, it requires a previously defined classifier as the assessment rule of feature selection; therefore, SVM is used as the RFE assessment standard to help RFE in the selection of feature sets. According to the experimental results of this study, with respect to parameter settings, the impact of parameter selection on the construction of SVM classification model is huge. Therefore, this study applies the Taguchi parameter design in determining the parameter range and selection of the optimum parameter combination for SVM classifier, as it is a key factor influencing the classification accuracy. This study also collected the experimental results of using different research methods in the case of Dermatology and Zoo databases [16,32,33], as shown in Table 11. By comparison, the proposed method can achieve higher classification accuracy.