Improved Instance Selection Methods for Support Vector Machine Speed Optimization

Support vector machine (SVM) is one of the top picks in pattern recognition and classification related tasks. It has been used successfully to classify linearly separable and nonlinearly separable data with high accuracy. However, in terms of classification speed, SVMs are outperformed by many machine learning algorithms, especially, when massive datasets are involved. SVM classification speed scales linearly with number of support vectors, and support vectors increase with increase in dataset size. Hence, SVM classification speed can be enormously reduced if it is trained on a reduced dataset. Instance selection techniques are one of the most effective techniques suitable for minimizing SVM training time. In this study, two instance selection techniques suitable for identifying relevant training instances are proposed. The techniques are evaluated on a dataset containing 4000 emails and results obtained compared to other existing techniques. Result reveals excellent improvement in SVM classification speed.


Introduction
Support vector machine (SVM) has performed remarkably well in classification related and pattern recognition problems.Its high classification accuracy makes it one of the most preferred machine learning (ML) algorithms.However, SVM has high classification complexity which scales linearly with number of support vectors (SVs).SV increases with increase in dataset instances.That is, massive dataset produces many SVs and consequently increases SVM classification speed, hence making it unsuitable for real time systems.SVM training time is ( 2 ), where n is number of training instances [1,2].Instance selection techniques have been successfully used to improve SVM classification speed and training complexity.These techniques are used to minimize SVM training time by extracting relevant instances from training set.The extracted instances (also called SVs) are instances close to decision boundary.Eliminating instances that are non-SVs does not have negative impact on SVM training result [1].This study presents two instance selection techniques and applies them to phishing email classification.A brief introduction to instance selection and phishing is presented next.
(1) Instance Selection.Numerous ML related problems require automatic classification of new instances.Prior to classification, a classifier is typically trained on a set of instances, called training set.Generally, training datasets contain redundant instances; hence, removal of these instances improves computational complexity of classifier.Instance selection techniques are designed to remove irrelevant instances from dataset.They aim at reducing training time of a classifier.Instance selection is particularly useful for instance-based classifiers, where classification of one instance involves the use of an entire training set [3]. Instance selection can start with either an empty set (incremental technique) or a training set (decremental technique) [3].For incremental technique, instances are added into an empty set, and for decremental technique, instances are removed from training set [3]. Instance selection techniques can be classified into two: wrapper and filter [3].Wrapper based instance selection depends on accuracy achieved by a classifier, and filterbased instance selection techniques do not depend on a classifier [3].Filter-based instance selection techniques are typically faster than wrapper based techniques [3].In this study, two filter-based instance selection techniques are developed and applied to email classification.Performances 2 Security and Communication Networks of these techniques are evaluated and they yielded excellent results.
(2) Phishing.Phishing is an effort to acquire sensitive information from users by electronic means, generally for fraud.Typically, phishing is perpetrated by creating a replica website of a legitimate organization.Phishing attack is one of the major threats encountered by many online users in recent times.Since the advent of electronic commerce in 1994, phishing has advanced at a fast pace [4].Undoubtedly, high patronage of online businesses is one of the primary causes of the raid increase in online fraud.In 2014, 1,198 companies lost 179 million US dollars to email scam [5].Between October 2013 and August 2015, 7000 companies in USA lost about 750 million US dollars to phishing [5].Moreover, between 2014 and 2016, total loss to email scam (by organizations) is estimated to be 2.3 billion US dollars [5]. The urgent need for a robust phishing detection system cannot be overemphasized.A secured phishing detection system should be capable of identifying and protecting users from known and novel phishing attacks [6].Many solutions have been proposed in literature to handle phishing; however, MLbased techniques are one of the few techniques that yielded high classification accuracy, because of their ability to detect both existing and emerging fraudulent attacks.This paper proposes two improved SVM-based solutions for phishing email classification.

Related Works
Many instance selection techniques have been proposed in literature to reduce the computational complexity of SVM.Some proposed wrapper and filter-based instance selection techniques are presented in this section.

Wrapper Based Instance Selection Techniques.
Wrapper techniques perform instance selection using a classification model [3].During instance selection, datasets are divided into subsets, and each subset is used to train a model.Afterwards, each model is tested on a separate subset.Furthermore, weight for each subset is evaluated by calculating the number of correctly classified instances.Finally, subset with the best weight is selected and used to build the main model.Although wrapper based techniques typically select optimal subsets, the selection process is time consuming.Some proposed wrapper based instance selection techniques for SVM speed optimization are presented below.
García et al. [7] introduced an evolutionary algorithm (EA) based technique for imbalanced classification in exemplar-based learning.In the study, authors calculated the distance of each piece of data to different exemplars and used EA to select the best exemplar.The selected exemplar was then used for training.In another work, Cano et al. [8] performed a study on performance of EA-based instance selection techniques.In the study, authors focused on four EA models and compared their performance to non-EA algorithms.Result obtained from study revealed that EA performed better compared to non-EA algorithms.Li et al. [9] proposed a SVM-based instance selection technique.In the study, authors combined SVM and a KNN-based instance selection technique (called DROP2 [10]).SVM was used to select SVs, and DROP2 was used to further reduce the selected SVs.The resultant dataset was then used to train SVM.Garain [11] proposed an instance selection technique based on Artificial Immune System (AIS).In the study, authors used the idea of AIS to select the fittest set of instances from a dataset.Zhang and Sun [12] proposed a tabu search based technique for instance selection.In the study, different subsets were selected and tabu search was applied to each subset.Each subset was evaluated, and subset that produces the best classification accuracy was selected.

Filter-Based Instance Selection Techniques.
Filter-based instance selection technique performs instance selection using a choice function [3].Instance selection is performed based on scores assigned to instances.Unlike wrapperbased techniques, instance subsets produced by filter-based techniques are usually not tailored to a certain type of classification model; they are more general.Some filter-based techniques are discussed next.
Riquelme et al. [13] proposed an instance selection technique for selecting boundary instances.Authors designed a selection rule that discards weak instances far from a boundary.Weakness of an instance is determined by weakness of all attributes which describes the instance.That is, weakness(I) = , where  is the number of features describing instance I. Lyhyaoui et al. [14] proposed a clustering-based instance selection technique for obtaining boundary instances in multiclass datasets.In the study, authors obtained boundary instances by selecting cluster centers close to opposite classes.In another work, De Almeida et al. [15] proposed a clustering-based technique using -means algorithm.The technique was designed with an assumption that training vectors close to a separating margin are prospective SVs, and training vectors far from margin are likely non-SVs.In the study, authors divided the training dataset into different clusters.Afterwards, training vectors in clusters containing only one class were discarded (only their cluster centers were considered) and training vectors in clusters containing more than one class were selected for training.Selection was based on the assumption that clusters with multiple classes possibly contain SVs, because they are near a separating margin.In another study, Chen et al. [16] proposed a clusteringbased instance selection technique.In the study, authors used clustering algorithm to obtain cluster centers of instances in a positive class.Afterwards, the cluster centers were used as reference points to select boundary instances.The algorithm was designed on the assumption that negative instance close to cluster centers of positive class and positive instance far from cluster centers of positive class are close to the boundary.In other words, positive instances close to a boundary contribute less to the decision surface and negative instances close to a boundary contribute more to the decision surface.
Panda et al. [1] proposed an instance selection technique capable of selecting data instances close to a decision boundary.The selected boundary instances are believed to be SVs.The technique consists of two stages.The first stage is responsible for identifying a set of nearest neighbors for all instances in a dataset, and the second stage is responsible for selecting instances close to a boundary.Authors developed a scoring function that assigns high scores to instances closer to a decision boundary.In another study, García et al. [17] introduced an instance selection algorithm based on memetic algorithm.Memetic algorithm combines EA and local search.In the study, authors designed the local search to select relevant instances and also improve classification accuracy.
In this study, for comparison purpose, two of the reviewed instance selection techniques were implemented and applied to phishing emails.The two techniques (Chen et al. [16] and Panda et al. [1]) and their results are presented next.
As aforementioned, Panda et al. [1] designed a scoring function for selecting instances close to a decision boundary.The scoring function is given in (  ,   ) denotes the score accorded to   by   .  is the squared distance from   to the closest instance of the opposite class on its neighborhood list. is the mean of During implementation in this study, squared Euclidean distance was used for distance computation.Pseudocode for the scoring function is shown in Algorithm 1, and classification step is given as follows [1]: (i) Identify all nearest neighbors (NN) for each instance in the dataset.
(ii) Compute exponential decay score for each instance and its NN belonging to opposite class.
(iii) Determine the score for each instance.
(iv) Based on the scores, select boundary instances.
Result for the KNN-based technique is shown in Table 1.
The table shows the result for varying number of  and varying number of subsets (i.e., boundary instances).The result reveals an improvement in SVM classification speed.Also, clustering-based technique proposed by Chen et al. [16] was implemented.Algorithm for the technique is shown in Algorithm 2 and classification steps are shown as follows [16]: (i) Select instances from dataset, , for positive class, PC.
(ii) Select instances for negative class, NC, where NC =  − PC.
(iii) Apply clustering to positive class to obtain cluster centers (or means).
(iv) Select boundary instances using obtained cluster centers.To achieve this, do the following.
(v) For each cluster center, (a) compute distance between cluster center and selected positive instances; (b) sort distance and remove positive instances that are close to the boundary;  As shown in Table 2, the algorithm improved SVM classification speed without degrading classification accuracy.

Proposed Instance Selection Techniques
This section presents two instance selection techniques proposed in this study.The first technique is based on firefly algorithm, and the second technique is based on edge detection in image processing.Both techniques were evaluated on a dataset consisting of 3500 ham emails and 500 phishing emails.The ham emails were obtained from SpamAssassin [18] and the phishing emails were obtained from https://monkey.org/[19].The datasets contain higher proportion of ham emails, because, in real world, mail users receive more legitimate emails than phishing emails.All the emails were well labelled and evenly distributed into 10 folders.Afterwards, 10-fold cross validation was performed.A brief introduction to firefly algorithm (FFA) and edge detection is discussed in Sections 3.1 and 3.2, respectively.

Dataset Processing and Feature Extraction.
Prior to classification, 16 features were first extracted from emails in the dataset.Features extracted are similar to the features used in one of our previous studies [20].Furthermore, the extracted features were normalized, and IG for all the features was calculated.Afterwards, best nine features were selected and converted to the input format required by libSVM [21], the SVM library used in this study.During classification, Gaussian transformation is used to scale down the feature vectors, to ensure that each vector has a mean of zero and a unit variance.Firefly parameters used in this study are similar to the parameters suggested by Yang [22].Also, parameter selection technique used in this study is similar to the technique recommended by Hsu et al. [23].More details are provided in Tables 10 and 11.

Firefly Algorithm.
FFA is a nature inspired (NI) algorithm, developed by Yang [24].It is based on the flashing behavior of fireflies [25].Most firefly species produce short flashlight at regular intervals to attract mating partners and prey and to send warning signals to predators [24].Firefly light intensity is inversely proportional to the square of the distance between fireflies.Additionally, as distance increases, light is absorbed in the atmosphere, and light intensity decreases [24].Flashlight can be formulated, such that it will be associated with the value of objective function.FFA has many variants; however, this study focuses on the original algorithm, formulated using three idealized rules as follows [24]: (1) Fireflies are unisex; hence they can be attracted to each other irrespective of their sex.(2) Firefly attractiveness and brightness are proportional, and they decrease with respect to distance.Therefore, brighter firefly attracts less bright fireflies.Also, fireflies move randomly if they are of equal light intensity.(3) Firefly brightness is determined by the objective function landscape.2)), and the best firefly is selected and used to train SVM.Each firefly consists of a binary array of  instances (called instance mask), where 1 indicates that an instance is selected, and 0 indicates otherwise.During experiment, instance mask for each firefly is randomly initialized to 0 and 1. Afterwards, objective function for each firefly is evaluated and the global best is saved.Furthermore, fireflies are moved from one position to another, their attractiveness is calculated, and their fitness value is updated.The process is repeated until a predefined number of generations are reached.Finally, the best firefly is selected, its instance mask is processed, and instances with the value of 1 are selected and used to train SVM.A constrain is added to ensure that at least  number of instances are selected for training, where  is user defined.Hence, if the total number of selected instances () is less than ,  additional instances are randomly selected, where  = −.This constraint is added to eliminate the possibility of having zero selected instances.

Objective
where TNI is size of instance mask and TS is total number of selected instances.

Edge Detection.
Edge detection in image processing is a technique used to identify object boundaries in images [26].Object boundaries are points in images with sharp change in image brightness [26].Generally, images contain some quantity of redundant data that requires pruning, for effective classification.Hence, to reduce computational complexity, edge detection is a highly essential preprocessing step [27].Edge detection is applied to images with the aim of identifying important features, removing less relevant information, and consequently reducing the image size.Generally, edge detection is used for segmentation of images,   feature extraction, and feature detection in image processing, computer vision, and machine vision [26][27][28].Edge defection conserves essential structural properties of images and computer space [27].Edge detection algorithms include Canny algorithm, Sobel algorithm, and Roberts algorithm.Figure 1 shows an example of an image and its detected edges.
Concept of edge detection in image processing is similar to concept of boundary detection in SVM classification.Edge detection aims to select objects located at boundary positions, and boundary detection algorithms aim to select instances (also called SVs) close to a decision boundary.In this study, an instance selection technique based on edge detection is proposed.

Edge Instance Selection Algorithm. This study proposes an instance selection technique called Edge Instance
Selection Algorithm (EISA).EISA borrows its idea from edge detection in image processing.Given a set of training instances, EISA identifies an edge instance and selects  instances close to it.EISA consists of two main stages: distance computation stage and edge instance selection stage.In the first stage, EISA computes squared Euclidian distance between each instance and all other instances in the dataset.Furthermore, in this stage, for each instance i in the dataset, based on the proximity of other instances to instance  , EISA votes a corresponding instance  (i.e., edge instance), where instance  is the farthest distance from instance  .In the second phase, firstly, edge instance with the highest vote is selected.Afterwards, -nearest neighbors of the voted edge instance are computed and used to train SVM model.Algorithm 4 shows the full algorithm of EISA.Some experiments were performed to test the efficiency of EISA, and result reveals that EISA significantly improved SVM classification speed.[29].The experiment was performed with the aim of comparing the performance of the proposed techniques to other instance selection techniques in literature.In the experiment, EISA, FFA IS, and FFA Clus were compared to five other instance selection techniques, namely, PSC [30], DROP 3 [10], DROP 5 [10], GCNN [31], and POCNN [32].Results for the experiment are shown in Table 7 and Figures 2 and 3.As shown, Statistical analysis of the results was performed using one-sample -test.The goal of the statistical analysis is to know whether it can be concluded, with 95% confidence level, that the proposed technique performs better (in terms of classification speed and accuracy) than PSC, DROP 3, DROP 5, GCNN, and POCNN.As aforementioned, 10-fold cross validation was performed, hence the reason for using -test.Also, since the number of samples is 10, from -test table, critical value is 2.2622.Result of the analysis is reported in Tables 8 and 9.As shown in Table 8, in terms of classification accuracy, there is a statistically significant difference between EISA and PSC.There is also a statically significant  Key:  = alpha,  = gamma,  0 = beta, and   = number of generations.
difference between FFA and PSC, DROP 3, DROP 5, GCNN, and POCNN.Moreover, there is a statistically significant difference between FFA Clus and PSC, DROP 3, DROP 5, GCNN, and POCNN.Furthermore, as shown in Table 9, in terms of classification speed, there is a statistically significant difference between EISA and DROP 3 and DROP 5.There is also a statistically significant difference between FFA and PSC, DROP 3, DROP 5, GCNN, and POCNN.Moreover, there is a statistically significant difference between FFA Clus and PSC, DROP 3, DROP 5, GCNN, and POCNN.produced balanced speed-accuracy trade-offs.In the future, the two proposed techniques will be tested on other ML algorithms, and more NI-based instance selection techniques will be developed and tested.

Conclusion and Future work
(c) compute distance between cluster centers and negative class; (d) add negative instances that are close to the boundary.(vi)End For (vii) Use selected positive instance for training.
Key: NFF: number of fireflies, NI: number of instances, GB: global best, APA: average prediction accuracy, FP: false positive, FN: false negative, : recall, Pr: precision, FM: -measure, and : time.3.2.1.FFA-Based Instance SelectionTechnique.This study introduces an instance selection technique (called FFA IS) based on FFA.FFA IS is designed with an objective of minimizing the number of instances used for training.Given a set of training instances, fireflies are evaluated (using the objective function defined in (

Table 4 :
Result for FFA IS.

Table 11 :
FFA parameters used for evaluations.