An Improved Grey Wolf Optimization Strategy Enhanced SVM and Its Application in Predicting the Second Major

1Wenzhou Vocational College of Science and Technology, Wenzhou, Zhejiang 325006, China 2Beijing Entry-Exit Inspection and Quarantine Bureau, Beijing 100026, China 3College of Computer Science and Technology, Jilin University, Changchun 130012, China 4Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China 5College of Physics and Electronic Information Engineering, Wenzhou University, Wenzhou 325035, China


Introduction
At present, most colleges and universities attach great importance to the needs of students' diversified development and provide the opportunity of categorization of professional direction for students in the high grade.Therefore, the fact that students choose the suitable major according to their own characteristics is of significant importance, which is the important prerequisite for promoting the career development in the future.However, students are apt to getting lost when they are faced with choosing the major because of a series of factors such as subjective consciousness, randomness, and blindness.Is there scientific and reasonable means to allow students to understand their own characteristics and to find the most suitable major development direction for their own?Colleges have accumulated a large number of data related to student resources during the personnel training process, and a lot of important information and knowledge is hidden in these massive data.We can extract the valuable information from these massive data by using data mining technology to construct a prediction model appropriate for major classification; therefore it can help the students to make the right decision on choosing the major.
So far, the data mining techniques in the instruction and evaluation, behavior analysis of teachers and students, scores analysis and prediction, occupation guidance, and other aspects have more applications, which also have proposed several methods to apply to course selection, major selection, and so on.Cheng et al. [1] proposed an innovative approach that combines a student concept model and the change mining mechanism for analyzing the learning problems of students from their historical assessment data.The experimental results showed that those analysis results provided by the innovative approach were helpful to the teachers in providing appropriate instructional assistance and remedial learning materials for improving the learning achievements of the students.Elbadrawy et al. [2] proposed to predict the students' performance according to a recommendation system based on personalized analysis.The results showed that the method of multiple regression and improved matrix decomposition could be more timely and accurate to predict the students' scores in the next term compared with the traditional methods.Ognjanovic et al. [3] proposed an approach for extracting student preferences from sources available in institutional student information systems.The extracted preferences were analyzed using the analytical hierarchy process, which was used for predicting students' course selection.The results demonstrated that the accuracy was high and equivalent to that of previous data mining approaches using fully identifiable data.Campagni et al. [4] presented a data mining methodology based on clustering and sequential patterns techniques to analyze the careers of university graduated students.The results underlined that the more the students follow the order given by the ideal career, the more they get good performance in terms of graduation time and final grade.Kardan et al. [5] proposed to use neural networks to establish models for analyzing and predicting college students' network selection.Experimental results showed that the model had higher prediction accuracy than Support Vector Regression, -Nearest Neighborhood, and Decision Tree.Thammasiri et al. [6] discussed the problem of unbalanced distribution and supported the model of vector machine and oversampling data equalization technology, which were used to predict the loss rate of new students.The results showed that this kind of data mining technology would achieve the best classification, making the overall prediction accuracy over 90%. Lee [7] performed logical regression analysis and found that a large number of computer courses learned at the middle school level had a significant impact on STEM subject selection in the US colleges.Huang and Xu [8] analyzed the evaluation theories of psychology and statistics and developed a second major selection system based on a rough set-based association rule mining algorithm.The simulation results demonstrated that their method was accurate for software engineering, networking, and programming majors.
To improve the performance of second major selection, this study proposes an improved support vector machine (SVM) based prediction system.In the proposed approach, an enhanced grey wolf optimization strategy (hereafter IGWO) was established to screen the representative features, and then SVM was employed to perform the prediction task based on the feature subset identified by the IGWO strategy.GWO is a new swarm intelligence method proposed recently by Mirjalili et al. [9].Due to its great exploration capacity, it has been successfully applied to many practical problems, such as load frequency control of interconnected power system [10], combined heat and power dispatch [11], design of castellated beams [12], and the two-stage assembly flow shop scheduling problem [13].However, the initial population of original GWO is generated randomly, which may lead to the fact that the grey wolves in search space are lacking diversity.Many studies [14][15][16][17][18] have shown that population initialization may affect the global convergence speed and also the quality of the final solution for swarm intelligence optimization.Moreover, the initial population with good diversity is helpful to promote the performance of optimization algorithm.Inspired by this idea, we use particle swarm optimization (PSO) to generate a diverse initial population and then construct a binary version of GWO to execute feature selection task.One main reason that we have chosen the PSO strategy lies in that the PSO approach is a simple approach with fast search speed and high efficiency compared with others.As shown, the experimental results have shown that PSO have indeed improved the quality of initial population for GWO.At the same time, it is also important to choose an effective and efficient classifier for evaluating the most discriminative features.In this study, the SVM classifier is used to compute the fitness value due to its good generalization capability and excellent performance in many classification tasks [18][19][20][21][22][23].
The efficacy of the resultant method, the IGWO-SVM based prediction system, was rigorously compared against SVM without feature selection, GWO based SVM (hereafter GWO-SVM), genetic algorithm based SVM (hereafter GA-SVM), and particle swarm optimization-based SVM (hereafter PSO-SVM) on the real-life dataset collected from Wenzhou Vocational College of Science and Technology.The classifiers were compared with respect to the classification accuracy (ACC), sensitivity, specificity, and the area under the receiver operating characteristic curve (AUC) criterion.The experimental results demonstrated that the proposed IGWO-SVM approach achieved much better performance than other competitive counterparts.
The rest of this paper is organized as follows.Section 2 offers brief background knowledge on SVM and binary version of GWO and PSO.Section 3 presents the detailed implementation of the proposed method.Section 4 describes the experimental design.Section 5 presents the experimental results and discusses the proposed approach.Finally, Section 6 summarizes the conclusions and recommendations for future work.

Support Vector Machine (SVM)
. SVM is a kind of classification algorithm, which is devoted to improving the generalization ability by seeking structural risk minimum of the learning machine.The core idea lies in it is the maximum margin strategy, which can be finally transformed into solving a convex quadratic programming problem.Thanks to the good property, SVM has found its applications in a wide range of fields [19,20,[23][24][25][26][27][28][29][30][31][32].
In a binary classification task, the samples are separated with a hyperplane    +  = 0, where  is a -dimensional coefficient vector that is normal to the hyperplane and b is the offset from the origin and  are data points.The main task of SVM is to get the results of  and b.In linear case,  can be solved by introducing Lagrangian multipliers.The data points on the maximum border are called support vectors.As a result, the solution of  takes the following form:  = ∑  =1     x  , where  is the number of SVs and   are the labels corresponding samples .After then  can be derived from   (w  x  + ) − 1 = 0, where x  are SVs.After  and b are determined, the linear discriminant function can be given by In nonlinear cases, a general idea of kernel trick is introduced.And then the decision function can be expressed as follows: Generally, any positive semidefinite functions that satisfy Mercer's condition can be used as kernel functions [33], such as the polynomial kernel ((,   ) = ((    ) + 1)  ) and the Gaussian kernel This section gives a brief description of SVM.For more details, one can refer to [34,35], which provides a complete description of the SVM theory.

Binary Grey Wolf
Optimization.Grey wolf optimization (GWO) is a metaheuristic algorithm proposed by Mirjalili et al. [9] in 2014.It mimics the social leadership and hunting behavior of grey wolves in nature.In every iteration of GWO, there are three fittest candidate solutions assumed as alpha, beta, and delta to produce that lead the population toward promising regions of the search space.The rest of grey wolves are named as omega and required to assist alpha, beta, and delta to encircle, hunt, and attack prey, that is, to find better solutions.
In order to mathematically simulate the encircling behavior of grey wolves, the following equations are proposed: where is the position vector of the prey, ⃗  is the position vector of a grey wolf, ⃗  is linearly decreased from 2 to 0, and ⃗  1 and ⃗  2 are random vectors in [0, 1].
In order to mathematically simulate the hunting behavior of grey wolves, the following equations are proposed: In this work, a new binary GWO (IGWO) is proposed for the feature selection task.Figure 1 presents a flowchart of the proposed IGWO.In the IGWO, each grey wolf has a flag vector, whose length is equal to the total number of features in the dataset.When the position of a grey wolf was updated by (6), the following equation is used to discrete the position.
where  , indicates the th position of the th grey wolf.

Binary Particle Swarm Optimization (PSO).
Particle swarm optimization (PSO) was first developed by Kennedy and Eberhart [36].The basic idea of PSO algorithm is to find the optimal solution through collaboration and information sharing among individuals in a group.The advantage of PSO is simple and easy to implement and has no many parameters to adjust.It has been widely used in function optimization, combinatorial optimization, neural network training, fuzzy system control, and other applications [37].In PSO, each individual is treated as a particle in d-dimensional space, and each particle has a position and velocity.The position vector of the th particle is represented as   = ( ,1 ,  ,2 , . . .,  , ), and its according velocity is represented as The velocity and position are updated as follows: where  1 and  2 are acceleration coefficients to better balance the search space between global exploration and local exploitation.In addition,  1 and  2 are random numbers in (8) generated uniformly in the range of For the purpose of feature selection, the binary PSO introduced by Kennedy and Eberhart [38] was employed.In this version of PSO, a sigmoid function is applied to transform the velocity from continuous space to probability space: The velocity updating scheme in (8) keeps unchanged except that  , ,  , , and  , ∈ {0, 1} and in order to ensure that bit can transfer between 1 and 0 with a positive probability, V max was introduced to limit V , .The new particle position is updated using the following rule: where sig(V , ) is calculated according to (10).

Proposed Methodology
In this section, we briefly describe the proposed system for second major selection.The proposed IGWO-SVM methodology consists of two big steps.In the first step, PSO is firstly used to generate the initial positions of population, and then IGWO is used to select best feature combination by searching for the feature space adaptively.In the second step, the SVM classifier is carried out to predict the classification accuracy based on the optimal feature subset.The flowchart of the proposed IGWO-SVM approach is presented in Figure 2. The IGWO is mainly used to adaptively search the feature space for best feature combination.The best feature combination is the one with maximum classification accuracy and minimum number of selected features.The fitness function used in IGWO to evaluate the selected features is the average classification accuracy over the 10-fold cross validation scheme.The pseudocode of the feature selection procedure is presented as shown in Algorithm 1. whether they enrolled in self-study undergraduate courses, and the scores of the basic course relevant to their majors.Table 1 shows a detailed description of these 11 factors.

Experimental Setup.
To verify the proposed approach, several methods including SVM without feature selection, PSO-SVM, GA-SVM, and GWO-SVM were employed for rigorous comparison.For SVM, LIBSVM implementation developed by Chang and Lin [39] was utilized.Data was first scaled into the range [−1, 1] before classification.The empirical experiment was conducted on an AMD Athlon 64 X2 Dual Core Processor 5000+ (2.6 GHz) with 4 GB of RAM, running Windows 7.
In order to gain an unbiased estimate of the generalization accuracy, the -fold cross validation (CV) was used to evaluate the classification accuracy [40].This study set  as 10; that is, the dataset is divided into ten subsets.Each time, one of the 10 subsets is used as the test set and the remaining 9 subsets are put together to form a training set.Then the average error across all 10 trials is computed.The advantage of this method is that all of the test sets are independent and the reliability of the results could be improved.It should be pointed out that only one repetition of the 10-fold CV will not generate enough classification accuracies for comparison due to the arbitrariness partition of the dataset.So the 10-fold CV will be repeated and averaged over 10 runs for accurate evaluation.Considering the two key parameters of SVM, penalty factor C and kernel width  are both set as {2 −5 , 2 −4 , . . ., 2 4 , 2 5 }.By the method of trial and error, when C = 2 5 (32) and  = 2 −3 (0.125), SVM achieved the best performance.Therefore,  and  for SVM are set to 32 and 0.125 in the experiments.The detailed parameter settings are outlined in Table 2.

Measure for Performance Evaluation.
To evaluate the performance of the second major selection by the IGWO-SVM approach, we mainly examined four metrics: classification accuracy (ACC), the area under the receiver operating characteristic (ROC) curve (AUC) [41], sensitivity (SE), and specificity (SP).
ACC is the proportion of the total number of predictions that were correct.It is determined as follows: SE is the proportion of positives that were correctly classified, as calculated using the following equation: SP is the proportion of negatives that were correctly classified, as calculated as follows: where TP is the number of true positives, which means that major "A" is correctly classified as the corresponding one; FN is the number of false negatives, which means that major "A" is classified as the major "B"; TN is the number of true negatives, which means that major "B" is correctly classified as the corresponding one; and FP is the number of false positives, which means that major "B" is classified as the major "A."The ROC curve is a graphical display that measures the predictive accuracy of a logistic model.The curve displays the true and false positive rate.The AUC is the area under the ROC curve, which is one of the best methods for comparing classifiers in two-class problems.

Experimental Results and Discussion
In this experiment, we firstly evaluated the effectiveness of the SVM model on the original feature space.In addition, it can be observed that the values of C and  can be specified adaptively by the grid search strategy for each fold of the data.The range of the related parameters C and  in the RBF kernel varied between  = {2 −5 , 2 −3 , . . ., 2 5 } and  = {2 −5 , 2 −3 , . . ., 2 5 }.The optimal parameter pair (, ) was employed to construct the predictive model.Figure 3 depicts the contour diagram of the two control parameters obtained by the grid search strategy on the 1st fold data for the SVM classifier.The 3D view of the parameter searching procedure for the 1st fold data can be also shown in Figure 4.
From the above figures, we can see that the performance of SVM is sensitive to the two parameters.
In order to further improve the performance of the SVM model, we evaluated the proposed IGWO strategy enhanced SVM approach on the same data.Table 4 lists the detailed classification results obtained by the IGWO-SVM approach.We can see that the IGWO-SVM has achieved a promising result with 87.32% ACC, 0.8722 AUC, 86.58% sensitivity, and 87.86% specificity.Compared with original SVM method, IGWO-SVM has lifted 5.29%, 5.28%, 3.32%, and 7.23% in terms of ACC, AUC, sensitivity, and specificity, respectively.It indicates that there are certain redundant and irrelevant  features existing in the data.In addition, it is interesting to find that the standard deviation obtained by the IGWO-SVM is smaller than that of the original SVM, which indicates that the IGWO-SVM can offer more robust and stable prediction results.
With the purpose of verifying the property of the constructed IGWO-SVM, the other metaheuristics based SVM methods including GA-SVM, PSO-SVM, GWO-SVM, and the original SVM for comparison were rationally proposed.The mean ACC and standard deviation obtained by each method via the 10 runs of 10-fold CV are vividly exhibited in Figure 5.It can be seen from this figure that the IGWO-SVM surpasses original SVM, GA-SVM, PSO-SVM, and GWO-SVM in terms of the ACC, AUC, sensitivity, and specificity, which means that the improved GWO based on PSO initialization can improve the performance of model selection of GWO to a certain extent.According to the standard deviation abstracted from Figure 5, a fact can be conducted that the standard deviation of the IGWO-SVM was much smaller than the other four competitors in terms of the four evaluation metrics including ACC, AUC, sensitivity, and specificity.Hence, the SVM constructed based upon IGWO can lead to a more stable classification result for selecting a second major than the others.In short, the IGWO can not only increase the property of SVM in terms of the other competitors but also construct a more stable classifier for selecting a second major.The comparison results indicate that IGWO-SVM is the most stable and robust method for  second major selection, followed by original SVM, GA-SVM, PSO-SVM, and GWO-SVM.
To explore the optimization procedures of the metaheuristic optimization methods including IGWO-SVM, GWO-SVM, PSO-SVM, and GA-SVM, we have recorded the evolutionary process of the four methods.Figure 6 vividly expresses the evolutionary process of the involved SVM methods based on various strategies.It can be seen that the SVM model trained via IGWO exhibits an obvious advantage over the other SVM methods in terms of not only convergence rate, but also the quality of solution, and it takes only 22 iterations to achieve the best fitness.However, the GWO-SVM, PSO-SVM, and GA-SVM need 62, 76, and 61 iterations to reach to the maximum fitness, respectively.In addition, it is interesting to find that the curve of IGWO-SVM ended at the 220th iteration since the fitness cannot be improved for 200 consecutive iterations which were preset as the termination condition.Another special fact also can be found that the GA-SVM has a better convergence rate than PSO-SVM, although its maximum fitness is smaller than PSO-SVM, which means that GA has superior performance than PSO in terms of the convergence rate.In short, the IGWO can significantly improve not only the convergence rate but also the SVM's classifier fitness compared to the original GWO, PSO, and GA.The main reason may be the good initialization population produced by the PSO algorithm.
Table 5 presents the frequencies of the selected features by the GA-SVM, PSO-SVM, GWO-SVM, and IGWO-SVM within the 10-fold CV procedure.It is interesting to find out that the average frequencies of six selected features including the 1st feature (F 1 ), the 3rd feature (F 3 ), the 7th feature (F 7 ), the 8th feature (F 8 ), the 10th feature (F 10 ), and the 11th feature (F 11 ) are more than five by the four methods.It indicates that the four methods are highly consistent to pick out the most important features.It also suggests that these features should be paid more attention in the decision-making process.The frequency of each feature in the course of the feature selection by the four methods can be vividly shown in Figure 7.It can be observed from the figure that the frequencies of F 1 , F 3 , F 7 , F 8 , F 10 , and F 11 selected by the GWO-SVM and IGWO-SVM are higher than the other two counterparts,  while F 2 , F 4 , F 5 , F 6 , and F 9 chosen by the GWO-SVM and IGWO-SVM are lower than the other two counterparts.And between the GWO-SVM and IGWO-SVM methods, IGWO-SVM performs even better than the other one.From the table, we can see that the frequencies of the most important features chosen by the IGWO-SVM approach are F 3 , F 1 , F 7 , F 8 , and F 10 ranked from highest to lowest.On the one hand, it indicates that GWO has better capability to remove the redundant and relevant features from the data.On the other hand, it also suggests that the IGWO strategy can pick out the most discriminative features from the dataset compared to the other three comparative counterparts.
From the above analysis, we can find that the most important features include whether they come from the south of Zhejiang (F 3 ), gender (F 1 ), third classroom which includes participation in the after-class graphic design activities (F 7 ) and participation in the after-class video production activities (F 8 ), and the basic course result of the classified major which includes the scores of the basic course relevant to graphic design (F 10 ) and the scores of the basic course relevant to video production (F 11 ); the influence of these characteristics on the choice of classified major is relatively prominent.Previous data show that the impact of whether they come from the south of Zhejiang on the choice of classified major is prominent, because most students return to hometown for work after graduation, and they will take into account the resources of the employment according to the classified major, such as Wenzhou, Taizhou, Lishui, Jinhua, and Quzhou where the electronic commerce and printing industry are prosperous and the students are more inclined to choose graphic design.In Hangzhou, Shaoxing, Jiaxing, Huzhou, Zhoushan, Ningbo, and other places where the film and animation industry are prosperous, the students are more apt to choose video production.There are great differences in the major adaptability between male and female students, which always results in the differences in the process of employment.Under the influence of many factors such as needing more effort on the shooting, the weight of film, and television equipment being relatively heavy, when the students who major in video production are faced with selecting the second major, the girls are less inclined to choose video production, while the boys have stronger adaptability.The third classes mainly include the students who participate in creative studio, professional skills competition, and student projects.Differing from the students' subjective understanding of their own interest, these cases from the participation of students objectively can reflect students' major interest.Thus it is also an important factor when the students are faced with selecting the second major.The scores of basic courses associated with the specialized direction also have big impact on the selection of the second major.The scores of basic courses can objectively reflect the students' major knowledge, so it can better guide the students to choose the most suitable major according to their own knowledge structure.According to the previous students who experience choosing the second major, we can know that the students who have achieved higher scores for basic courses such as graphic design and the basis of digital graphic form are more inclined to choose the major of graphic design, while the students who achieve high scores for basic courses such as basic video and basic photography are more inclined to choose the major of video production.

Conclusions and Future Work
In this study, we have examined an improved SVM for effectively predicting the students' major selection.To exploit the maximum potential of SVM, an improved GWO strategy was established to search for the optimal feature subset for classification.The experimental results have demonstrated that the developed approach has achieved more superior classification performance over other advanced machine learning approaches in terms of the ACC, AUC, sensitivity, and specificity.Therefore, it can be safely concluded that the developed intelligent system can serve as a promising alternative decision support system for students' second major selection.In the future work, we plan to implement our approach in a parallel manner to further improve the computational cost.In addition, collecting more data samples to ameliorate the prediction performance of the proposed system is the other work that should be done in the future.

F 11
The scores of the basic course relevant to video production Average scores of the video production courses, such as 2D animation design and video production basics.

Figure 3 :
Figure 3: Contour diagram obtained by the grid search for the SVM classifier on the 1st fold data.

Figure 4 :
Figure 4: Training accuracy surface of SVM with parameters obtained by the grid search for the 1st fold data.

Figure 5 :
Figure 5: The classification performance obtained by the five methods in terms of ACC, AUC, sensitivity, and specificity.

Figure 6 :
Figure 6: The average results of best fitness during the training stage in one run of 10-fold CV obtained by the four methods.

Figure 7 :
Figure 7: The frequency of each feature chosen by IGWO-SVM over the 10-fold CV procedure.
Calculate the fitness of grey wolves with selected features; alpha = the grey wolf with the first maximum fitness; beta = the grey wolf with the second maximum fitness; delta = the grey wolf with the third maximum fitness; the students decided on the graphic design major, and 207 students selected the video production.The 11 factors include gender, type of college entrance applications, whether they came from the south of Zhejiang, whether they were science students, whether they volunteered to major in digital media, basic course scores, participation in after-class activities, Figure 2: Flowchart of the proposed IGWO-SVM based system.

Table 1 :
Description of the dataset.

Table 2 :
The settings of parameters.

Table 3 :
The detailed results obtained by SVM.

Table 5 :
Average frequencies of the selected features by the four methods.