Feature Selection and Parameter Optimization of Support Vector Machines Based on Modified Artificial Fish Swarm Algorithms

Rapid advances in information and communication technology have made ubiquitous computing and the Internet of Things popular and practicable. These applications create enormous volumes of data, which are available for analysis and classification as an aid to decision-making. Among the classificationmethods used to deal with big data, feature selection has proven particularly effective. One common approach involves searching through a subset of the features that are the most relevant to the topic or represent the most accurate description of the dataset. Unfortunately, searching through this kind of subset is a combinatorial problem that can be very time consuming. Meaheuristic algorithms are commonly used to facilitate the selection of features. The artificial fish swarm algorithm (AFSA) employs the intelligence underlying fish swarming behavior as a means to overcome optimization of combinatorial problems. AFSA has proven highly successful in a diversity of applications; however, there remain shortcomings, such as the likelihood of falling into a local optimum and a lack of multiplicity. This study proposes a modified AFSA (MAFSA) to improve feature selection and parameter optimization for support vectormachine classifiers. Experiment results demonstrate the superiority ofMAFSA in classification accuracy using subsets with fewer features for givenUCI datasets, compared to the original FASA.


Introduction
Advances in information and communications technology have led to a rapid increase in the data processing and computing power of handheld devices.This has made it possible to obtain information anytime and anywhere, ushering in the era of ubiquitous computing (ubicomp) and the Internet of Things (IoTs).Applications, such as photo sharing and social networking, create enormous volumes of digital data, which is available to aid in decision-making.Feature selection is particularly effective in the classification of big data in fields such as data mining, pattern recognition [1], bioinformation [2], arrhythmia classification [3], and numerous others.
The classification of data requires the classification of data we already know and then divide them into training data and testing data.After building a classification model using the training data, the testing data is used to evaluate the model according to classification accuracy.SVM is based on the statistical theories proposed by Chervonenks [12] and the principle of minimizing structural risk.SVM is commonly used to solve classification or regression problems by finding the optimal hyperplane.SVM provides excellent classification accuracy using a small training set and is easy to implement.This study adopted SVM as a classifier in conjunction with the wrapper method of feature selection.
Feature selection is used to filter out large amounts of unnecessary data, much of which is irrelevant, redundant, and/or noisy.Irrelevant features are unrelated to a given goal and redundant features represent the same information, despite containing different types of data.Noisy features contain wrong or missing data.Sifting through this unnecessary data would result in enormous computing costs or even skew the results.Feature selection is used to identify the features that are essential to a given task.
Two methods are commonly used for feature selection: filter and wrapper.The filter approach is based on assigning weights to every feature, such as distance or dependability, and then combining the features with the highest weights in order to obtain an optimal subset.The wrapper involves collocating using metaheuristic algorithm and then assembling an optimized subset of features by eliminating or combining features and calculating a fitness value for each feature subset on the basis of classification accuracy.The filter approach tends to be quicker but suffers from lower classification accuracy.The need for the classifier to conduct training extends the processing of the wrapper approach; however, the classification accuracy is far higher.
One common wrapper method involves compiling a subset of optimal features of the highest relevance with the most accurate description of the characteristics in the dataset.This can be very time consuming due to the fact that any increase in the number of features exponentially expands the number of combinations in the feature subset, which is known as the curse of dimensionality [13].This study used a metaheuristic algorithm to obtain good-enough or nearoptimal feature subsets within a reasonable amount of time.By reducing much of the unnecessary data, feature selection processes can enhance classification accuracy and reduce processing time.
Metaheuristic algorithms are widely used to solve problems of optimization, such as schedule management [14,15], function optimization [16], and intrusion detection [17].Metaheuristic algorithms combine random search functions with empirical rules, and many of these methods have been inspired by mechanisms found in nature, such as genetic algorithms (GA) [18][19][20], based on gene mutation and mating, and particle swarm optimization (PSO) [21], based on the movements of flocks of birds.In 2002, the artificial fish swarm algorithm (AFSA) [22] was proposed to solve problems of optimality by simulating the movement of schools of fish and the intelligence underlying these behaviors.Numerous studies have demonstrated the efficacy of AFSA [14,17,23]; however, a number of shortcomings must still be addressed.In [24], various defects were pointed out, such as the likelihood of falling into a local optimum and a lack of multiplicity.
This study proposed a feature selection model combining the modified AFSA (MAFSA) with SVM.MAFSA is used to simulate the mechanism underlying the endocrine system in order to create a different search space for every individual fish in order to enhance the efficiency with which optimal solutions are derived.
Section 2 introduces SVM and the MAFSA.Section 3 outlines the proposed method based on a combination of these principles.Section 4 describes our experiments and results and in Section 5, we draw conclusions and describe our future work.

Risk
Confidence interval

Empirical risk
h

Background
2.1.Support Vector Machine.Since its introduction by Vapnik in 1995, [5], SVM has become a very popular classifier due to its ease of use and high classification accuracy even when using small training sets.This method of supervised learning is based on the Vapnik-Chervonenkis (VC) dimension and structural risk minimization theory [12,25].The VC dimension is used for function sets with only two values: 0 and 1.
The function set has  samples and regardless of how the position of the function samples is changed, the dimension can be separated from the samples.For example, with ℎ as the maximum number of samples, an increase in ℎ will lead to an increase in .As shown in (1), the upper bound of generalization error () is the sum of training error ( Emp ) and confidence interval (CI): where  represents the generalization error (also called testing error),  Emp represents training error, and Φ represents the CI.An increase in  leads to a reduction in the CI and an increase in the VC dimension leads to an increase in ℎ.
When the VC dimension increases, the differences between testing error and training error also enlarge.Thus, reducing the complexity of the classification model and alleviating testing error require that we minimize training error as well as VC dimensions.
The value for  in the samples is influenced by VC dimension, such that an increase in empirical risk reduced the CI.Conversely, a reduction in empirical risk increases the CI without affecting the total risk.Thus, we must consider empirical risk as well as confidence interval and the tradeoff between them in order to minimize the total risk.Figure 1 [12] illustrates the connection between the CI and empirical risk.
The primary function in SVM involves finding the optimal hyperplane and using it for the classification of data.As shown in Figure 2, the optimal has maximal margin to those two classes.In Figure 2, the round and square points in the maximal margin line are the support vectors.This illustrates why the optimal hyperplane (with maximal margin) is able to achieve the highest classification accuracy.
The optimal hyperplane is able to classify only two classes; however, in real world situations, classification problems nearly always involve more than this.Thus, we need to use kernel function (Φ) in order to map the data into a higher VC dimension plane.Three common kernel functions can be used for different situations: radial basis functions (RBFs), polynomials, and sigmoids, as shown in formulas (2), (3), and (4), respectively.The RBF kernel function provides the best performance and versatility of these three methods; therefore, we adopted RBFs as the SVM kernel function.
RBF kernel is Polynomial kernel is Sigmoid kernel is In [25], the authors introduce the soft margin method to process the mislabel examples located within the margin.The soft margin suggests the slack variables which measure the misclassification degree of the samples.And, the penalty parameter, , is used to weight the misclassification degree.It was shown that setting the right values of penalty parameter and RBF kernel parameter for SVM could greatly enhance the effectiveness of classification in [26].

Artificial Fish Swarm Algorithm (AFSA).
AFSA is a metaheuristic algorithm combining the concept of random search and empirical rules.AFSA solves optimization problems by simulating the movement of schools of fish and the intelligence underlying these behaviors.There are three types

Parameter name Formula
Distance (  ,   ) Crowd degree (  ) Neighbors of   Total number of Fishes of AFSA: follow, swarm, and prey.AFSA repeatedly executes these three functions for every individual fish in order to find an optimal solution.Every fish represents a solution and a fitness number represents the value of each solution.Solving the problem of feature selection requires conversion into a form that can be dealt with by AFSA.Table 1 presents the form of feature sets for each fish.
In Table 1,  and  represent the parameters of RBF kernel functions in SVM, and  1 to   represent the features.A feature value of 0 means that it is not selected, while a value of 1 means that the feature is selected.
The steps involved in feature selection for AFSA are presented as follows.
(1) Initialization.Randomly initialize the value of the feature subsets of each fish.
(2) Evaluate Fitness.Use the fitness function to evaluate the fitness value of the feature subset of every fish.
(3) Optimization Steps of Fish Swarm.Search for an optimal solution by executing the three search steps (follow, swarm, and prey) for each fish.
(4) Output the Optimal Feature Subset.If the terminal conditions of algorithm are satisfied, then stop the algorithm and output the optimal feature subset.Otherwise, initiate the algorithm to proceed through another iteration starting at Step (2).
The three optimal search steps of AFSA (follow, swarm, and prey) are presented in the following and the parameters are listed in Table 2.
(i) Follow.Executing follow for   would involve comparing its fitness value with that of the best fish in its vicinity (among its neighbors).If the best fitness value among the neighboring fish is better than that of the fish in question and the crowd degree of the neighbor does not exceed the maximum, then the feature subset and parameters are replaced with those of that best neighbor.As long as the replacement of subset and parameters is successful, then the algorithm executes follow for next fish; otherwise, swarm is executed for   .
(ii) Swarm.In the event that the follow function fails for   , then the algorithm would proceed to swarm.At this point, the fitness value of   would be compared with that of the center of the neighboring fish, as shown in Table 2.If the fitness value of the center is better than that of the fish in question and the crowd degree of the center does not exceed the maximum crowd degree, then the feature subset and parameters are replaced with those of the center.As long as the replacement of subset and parameters is successful, then the algorithm executes follow for next fish; otherwise, prey is executed for   .
(iii) Prey.In the event that the swarm function fails for   , then the algorithm would proceed to prey.In this step, the algorithm makes random changes to the features of   in order to create new random fish.The maximum number of changes never exceeds the vision of the fish in question.If the fitness value of the random fish exceeds that of   , then feature subset and parameters are replaced with those of the random fish.As long as the replacement of subset and parameters is successful, then the algorithm executes follow for next fish; otherwise, the algorithm would continue searching for random fish until reaching the stipulated maximum number of tries.
The parameters of AFSA are defined as follows.
(i) Distance.The distance between   and   was calculated according to the formula presented in Table 2.For example, if both fish have the same dimension  and if the first feature of both fish was 0, then the distance between   and   would remain the same.In contrast, if the first feature of   was 0 and the first feature of   was 1, then the distance between the two fish would be increased by 1.The total distance between the two fish is equal to the sum of the differences of all  features.
(ii) Vision.The vision of every fish was calculated using the formula in Table 2. Vision also determines the maximum number of random feature changes that will be implemented when initiating prey.
(iii) Neighbor.The neighbors of   were determined using the formula in Table 2. Any fish   with a distance exceeding 0 but not exceeding the vision of   is deemed a neighbor of   .
(iv) Center.The center of   was calculated using the formula in Table 2.The center can be viewed as an individual fish.If, among more than half of the fish in the neighborhood of   , the first feature is 0, then the value of the center is set to 0. If, among more than half of these fish, the first feature is not 0, then the value of the center is set to 1.
(v) Crowd Degree.The crowd degree of   was calculated using the formula in Table 2.This parameter represents the density of fish in the vicinity of   .
(vi) Maximum Crowd Degree.In the execution of steps follow and swarm, we sought to prevent the agglomeration of all fish at the same point by designating that if the crowd degree of   exceeded the maximum, then no other fish would be permitted to approach this location.In other words, no other fish would be able to replace its feature subset using with that of   .
(vii) Maximum Number of Attempts.This is the maximum number of times that prey could be executed.

Modified Artificial Fish Swarm Algorithm
Researchers have revealed a number of shortcomings in the function of AFSA, such as the likelihood of falling into a local optimum and the lack of multiplicity.This study developed a modified artificial fish swarm algorithm with two fundamental changes: dynamic vision and improvement of the ability to search for the best fish swarm.

Dynamic Vision.
The vision parameter plays a crucial role in the performance of AFSA, by determining the number of neighboring fish with which the target fish will interact, which largely determines the success of steps follow and swarm.Setting the vision parameter higher increases the likelihood of finding fish with higher fitness levels; however, this can lead to the centralization of the fish swarm to a particular location.This also tends to limit the diversity of species, which can easily fall into a local optimum.
Setting the vision parameter lower reduces the number of neighboring fish (and thus the likelihood of finding a fish with higher fitness values) and causes the school to scatter while increasing the diversity of species.This increases the search space as well as the likelihood of finding an optimal solution; however, the time required to reach convergence is extended.Finding a reasonable balance with regard to the assignment of the vision parameter can be troublesome.To overcome this difficulty, this study developed a mechanism referred to as dynamic vision, in which each fish is assigned visions parameters suitable to its conditions.For example, individual fish with a lower fitness value require greater vision in order to find a solution quickly.Conversely, fish with a higher fitness value require a smaller vision parameter in order to enhance local searches.We employed the endocrine-based formula, as outlined in [26], in order to provide dynamic vision.Individual fish obtain their own vision parameter values according to their fitness values.For example, if the fitness value is above the average, then the vision parameter is decreased, and vice versa.The endocrine-based formula is presented as follows: where EM() represents the endocrine of fish   ,  max represents the maximum fitness value of the fish in the school, and  avg represents the average fitness value in the school. −1 represents the fitness values of fish  −1 ;  +1 represents the fitness value of fish  +1 .In order to adjust the range of the endocrine system,  1 () = atan(),  2 () = atan(−).In formula (6), vision() represents the original vision parameter value calculated as the average distance between each fish and CV is the adjustment constant.Figure 3 presents an example of dynamical vision.In the original vision parameter, fish  and  would have had the same vision, regardless of their fitness values.Following adjustment using the endocrinebased formula, the fitness value of fish  is improved to 80, which decreases its vision in order to enhance its ability to perform local searches.In contrast, the fitness value of fish  dropped to 65, which enhanced the vision of fish , thereby enhancing global search capacity in order to obtain solutions more quickly.

Searching for the Best Fish
Swarm.This study employed a simple method of searching for the best fish swarm to enhance local search ability and prevent falling into a local optimum.After each iteration has finished, the algorithm copies the fish with the best fitness value into a fish swarm and then executes a simple local search to enhance the possibility of finding an optimal solution.Four parameters were added to facilitate this change: best fish number (BFN), best fish search number (BSN), best fish mutation rate (BMR), and minimum change number (MCN).BFN represents the number of fish copied.For example, setting BFN = 5 would cause the five best fish to be copied to the best fish swarm.BSN represents the local search number for each fish in the best fish swarm.Setting BSN = 10 would cause the algorithm to execute local ten searches for each fish in the best fish swarm.BMR represents the mutation rate of local searches and the MCN represents the minimum number of feature changes in each local search.For example, suppose that feature number was set to 8, BMR was 0.1, and MCN was set to 1, 8 * 0.1 = 0.8.In this case, 0.8 is less than the value of MCN (1); therefore, 1 random change would be executed in every local search in the best fish swarm.Figure 5 presents an example of a local search intended to assemble a swarm of the best fish in the vicinity of a given fish.After five local searches among this collection of high scoring fish, the fish that had previously been identified as best fish 1 is replaced by the fish with the best feature subset.Figure 4 presents a flowchart of the MAFSA.

Experiment Environment.
To compare the performance of MAFSA and AFSA, we used datasets commonly used in machine learning, the UCI dataset [27].Ten datasets with different numbers of records, different features, different classes, and from different fields are presented in Step follow Follow success?

No
Step swarm

Swarm
Step prey success ?  the penalty parameter and  representing the parameters of the RBF kernel function.This used libsvm [28] as a classifier.
In [26], it was shown that setting the right parameters for SVM could greatly enhance the effectiveness of classification.As mentioned in Section 2.2, the proposed algorithm determines the optimal values for these two parameters.To ensure the reliability of the experiment, we used tenfold crossvalidation.Figure 6 presents a flowchart of the proposed model used for feature selection using MAFSA and SVM.To achieve tenfold cross-validation, the dataset was divided into two parts, training data and testing data.Various feature subsets were created along with throughout the process of MAFSA.After deleting the features and corresponding data which were not selected, the training data was input into SVM to build a classification model.SVM then outputs a value representing the accuracy of classification for each feature subset in the form of a fitness value.MAFSA would stop and output the optimal feature subset only after the terminal conditions were satisfied.In this case, the terminal condition of each fold was a lack of changes in the optimal features subset for a period of 1 hour.

Experiment 1: Feature Selection without Parameter Optimization.
To compare the effectiveness of AFSA and MAFSA, we examined three experiment results: (1) the dataset without features, (2) the dataset after feature selection using AFSA and SVM, and (3) the dataset after feature selection using MAFSA and SVM.Experiment 1 did not involve search for the optimal SVM parameters  and .In other words, Experiment 1 used the default values of parameters  and  found in libsvm.The results are presented in Table 5.
In Table 5, the first group is the dataset without feature selection, which did not provide satisfactory results with regard to classification accuracy or the number of selected features.Furthermore, because this group used data without feature selection, the number of selected features was the same as the in the original.The second group was the result for a dataset that underwent feature selection using AFSA, and the third group was the result after feature selection using MAFSA.The classification accuracy of Groups 2 and 3 were far better than those of Group 1, and the number of selected features was far less than that of Group 1. MAFSA provided the best classification accuracy in all ten of the datasets; however, in five of the datasets, the results were on par with those obtained using AFSA.MAFSA resulted in fewer selected features in eight of the ten datasets.
In five of the datasets with fewer features, MAFSA was unable to exceed the classification accuracy of AFSA.Nonetheless, MAFSA still produced fewer features.Experiment 1 did not involve SVM parameters  or ; therefore, AFSA and MAFSA were able to search nearly all of the possible feature subsets.In some folds of the datasets, more than one of the feature subsets provided the same best classification accuracy.For this reason, MAFSA resulted in fewer selected features but maintained the same classification accuracy, thereby demonstrating the superior search ability of MAFSA.

Experiment 2: With Feature Selection and Parameter
Optimization.Experiment 2 compares the performance of AFSA and MAFSA in which SVM parameters  and  were also considered.In other words, AFSA and MAFSA searched not only for the optimal feature subset but also the optimal parameters of  and .The results are presented in Table 6.
In Table 6, MAFSA is shown to have higher classification accuracy in eight of the ten datasets as well as fewer selected features in six of the ten datasets.After determining SVM parameters  and , the classification accuracy in all ten of  the datasets improved significantly.MAFSA was shown to have higher classification accuracy in most of the datasets; however, it little differentiated AFSA from MAFSA.We therefore used statistical analysis to verify the improvements obtained by using MAFSA.Data analysis was performed using SPSS.We employed the Friedman test, which is a statistical analysis method used to test the performance of algorithms.The Friedman statistics test is used for nonparametric statistical verification of whether  related samples differ significantly.We use the classification accuracy of AFSA and MAFSA in the ten datasets in Table 6 to verify the superior performance of MAFSA.The results indicated a significance value of 0.02, which is less than 0.05; that is, the test provides 95% CI that MAFSA provides an improvement in classification accuracy over the results obtained using AFSA.

Conclusion and Future Works
This research proposed a modified version of the artificial fish swarm algorithm in conjunction with support vector machine for feature selection.MAFSA differs in its use of  an endocrine-based formula to provide dynamic vision to assign an appropriate search space for each fish.We also included a mechanism by which a swarm of the best fish can be assembled to enhance local search ability.Experiments demonstrate that MAFSA is superior to AFSA with regard to classification accuracy as well as the number of selected features.
Nonetheless, MAFSA still has room for improvement.For example, all of the parameters of MAFSA could be selected dynamically to further enhance adaptability to different datasets.We designed binary-coded algorithms to deal with the encoding of feature selection.We expect to apply MAFSA to the optimization problem of continuum.We expect that MAFSA could be used to solve real world optimization problems in a wide range of applications.

Figure 1 :
Figure 1: Connection between confidence interval and empirical risk.

Figure 3 :
Figure 3: Original vision and dynamic vision.

Figure 5 :
Figure 5: Process of assembling swarm of best fish.

Figure 6 :
Figure 6: Flowchart of feature selection of MAFSA and SVM.

Table 1 :
Representation of feature sets.

Table 3 .
The parameters of MAFAS and AFSA are presented in Table4.As shown in Table4, AFSA does not have the BFN, BSN, BMR, or MCN parameters, which are used only for assembling a swarm of the best fish when using MAFSA. and  are the parameters of SVM with  representing

Table 5 :
Result of Experiment 1.It indicates the fewest number of selected features.It indicates the highest classification accuracy.

Table 6 :
Result of Experiment 2. It indicates the fewest number of selected features.It indicates the highest classification accuracy.