A Core Set Based Large Vector-Angular Region and Margin Approach for Novelty Detection

. A large vector-angular region and margin (LARM) approach is presented for novelty detection based on imbalanced data. The key idea is to construct the largest vector-angular region in the feature space to separate normal training patterns; meanwhile, maximize the vector-angular margin between the surface of this optimal vector-angular region and abnormal training patterns. In order to improve the generalization performance of LARM, the vector-angular distribution is optimized by maximizing the vector-angular mean and minimizing the vector-angular variance, which separates the normal and abnormal examples well. However, the inherent computation of quadratic programming (QP) solver takes 𝑂(𝑛 3 ) training time and at least 𝑂(𝑛 2 ) space, which might be computational prohibitive for large scale problems. By (1 + 𝜀) and (1 − 𝜀) -approximation algorithm, the core set based LARM algorithm is proposed for fast training LARM problem. Experimental results based on imbalanced datasets have validated the favorable efficiency of the proposed approach in novelty detection.


Introduction
The task of novelty detection is to learn a model from normal examples in training patterns and hence can classify the test patterns.In real-world novelty detection applications, it is usually assumed that normal training patterns can be well sampled, while abnormal training patterns are severely undersampled, which is due to expensive measurement cost or infrequency of abnormal events.Therefore, only normal training patterns are used to build detection model in most novelty detection algorithms.Generally, novelty detection may be seen as one-class classification problem.Recently, novelty detection has gained much research attention in realworld applications such as network intrusion detection [1], jet engine health monitoring [2], medical data [3], and aviation safety [4,5].
In this paper, the kernel-based novelty detection algorithm is studied in-depth, which is very popular and has been proved to be successful recently.Various kernel-based novelty detection approaches have been proposed, such as one-class support vector machine (OCSVM) [6] and support vector data description (SVDD) [7].OCSVM was proposed by Schölkopf et al. [6], in which, to improve generalization ability, novelty detection boundary is constructed to separate the origin from the input samples with the maximal margin.The performance of OCSVM is very sensitive to the parameters, making it difficult to be generalized to other applications [8].
SVDD was proposed by Tax and Duin [7], in which the minimal ball is constructed to enclose most of the training samples.Novelty point is assessed by determining whether a test point lies within the minimal ball or not.The margin between the closed boundary surrounding the positive data and that surrounding the negative data is zero, which makes the method of poor generalization ability.A small sphere and large margin (SSLM) approach was proposed by Wu and Ye [9], in which the smallest hypersphere is constructed to surround the normal data; meanwhile, the margin from any outlier to this hypersphere is as large as possible.An incremental weighted one-class support vector machine for mining streaming data was proposed by Krawczyk and Wózniak [10,11], in which the weights to each object are modified according to its level of significance, and the shape of the decision boundary is influenced only by new objects that carry new and useful knowledge extending the competence of the classifier.

2
Mathematical Problems in Engineering Support vector machine (SVM) can be solved through figuring out quadratic programming (QP) problem, which has the important computational advantage of avoiding the problem of local minima.However, solving the corresponding SVM problems using the naive implementation of QP solver takes ( 3 ) computational time complexity and at least ( 2 ) space complexity if the number of training patterns is .Obviously, the naive implementation of QP solver is difficult to meet the practical application of novelty detection in large scale datasets.Tsang et al. proposed the core vector machine (CVM) [12,13] as the approximation algorithm of minimum enclosing ball (MEB) for large scale problems.The key idea is that the implementation of QP solver for corresponding SVM problems could be equivalently viewed as MEB problems.By utilizing an approximation algorithm for the MEB problem in computational geometry, the time complexity of CVM algorithm is linear to the number of training patterns.Moreover, the space complexity is irrelevant to the number of training patterns.
As mentioned above, only normal training patterns are used to build the detection model in most novelty detection algorithms.In practical applications of novelty detection, it is difficult, but not impossible, to obtain a very few abnormal training patterns.For instance, in machine fault detection, in addition to extensive measurements on the normal working conditions, there may be also some measurements on faulty situations [14].Recently, extensive and comprehensive researches have been carried out in both academia and industry to solve the imbalanced novelty detection problem.
Kernel-based novelty detection based on imbalanced data is researched in this paper.Suppose  = {(x  ,   )},  = 1, . . ., , is a given training dataset with  examples, where x  ⊂   is the th input instance,   ∈ {−1, +1} is a class identity label associated with instance x  ,  maj ⊂  is the set of majority training patterns and | maj | =  1 ,  min ⊂  is the set of minority training patterns and | min | =  2 , and  1 +  2 = .(⋅) is the feature mapping function defined by a given kernel function (⋅, ⋅).The length of the perpendicular projection of the training pattern (x  ) onto the vector k is expressed as ⟨k, (x  )⟩, which actually reflects the information about the angular and the Euclidean distances between k and (x  ) in the Euclidean vector space.According to the definition in [15], ⟨k, (x  )⟩ is called vector-angular.
In this paper, a large vector-angular region and margin (LARM) algorithm and its fast training method based on core set are proposed for novelty detection, where the training patterns are imbalanced.The main contributions of this paper lie in three aspects.Firstly, the boundary of SVM is only determined by the support vectors and the distribution of the data in the training set is not considered [16].However, recent theoretical results have proved that data distribution information is crucial to the generalization performance [17,18].The proposed algorithm in this paper aims to find an optimal vector k in the feature space, in which the mean and the variance of vector-angular are maximized and minimized, respectively.Therefore, normal and abnormal examples are well separated when projected onto the optimal vector k joining their large mean and small variance.Secondly, the proposed LARM integrates one-class and binary classification algorithms to tackle the novelty detection problem based on imbalanced data, which constructs the largest vector-angular region in the feature space to separate normal training patterns and maximizes the vector-angular margin between the optimal vector-angular region and the abnormal data.Since the number of normal training patterns is sufficient, the largest vector-angular region is constructed accurately, which can minimize the chance of accepting the normal examples.To achieve better generalization performance, the vector-angular margin between the surface of this optimal vector-angular region and the abnormal data is maximized.Thirdly, the core set based LARM algorithm is proposed for fast training LARM problem.The time and space complexity of core set based LARM are linear to and independent of the number of training patterns, respectively.
The structure of this paper is organized as follows.Section 1 introduces the novelty detection technique and presents an analysis of the existing problems.Section 2 introduces ]-support vector machine (]-SVM), two-class SVDD, and maximum vector-angular margin classifier (MAMC).Section 3 presents the proposed LARM for novelty detection and its fast training method based on core set.Experimental results are shown in Section 4 and conclusions are given in Section 5.

]-SVM, SVDD, and MAMC
2.1.]-SVM.]-SVM was proposed by Schölkopf et al. [19] to solve the binary classification problem, which uses the parameter ] to control the number of support vectors and the bound of the classification errors.]-SVM can be modeled as follows: min w,,, where w is the normal vector of the decision hyperplane,  is the bias of the classifier,  is the margin,  = [ 1 , . . .,   ] T is the vector of slack variables, and ] is a positive constant.]-SVM obtains the optimal hyperplane w T (x  )+ = 0 for separating the two classes with a maximal margin /(2‖w‖).To classify a testing instance z ∈   , the decision function takes the sign function of the optimal hyperplane (z) = sgn(w T (z) + ).
where  and c are the radius and the center of the hypersphere,  1 and  2 are two trade-off parameters which can treat imbalanced datasets, and  = [ 1 , . . .,   ] T is the vector of slack variables.The testing instance z ∈   can be determined, whether it is inside of the optimal hypersphere or not.Hence, the decision function of two-class SVDD is (z) = sgn( 2 − ‖(z) − c‖ 2 ).Hu et al. in 2012 [15], which attempts to find an optimal vector k in the feature space based on the maximum vector-angular margin.MAMC can be modeled as follows:

MAMC. MAMC was proposed by
where k is the optimized vector,  is the vector-angular margin,  = [ 1 , . . .,   ] T is the vector of slack variables, and  and ] are two positive constants.To classify a testing instance z ∈   , the decision function is defined as (z) = sgn(∑  =1 (1/ +     /2)(x  , z)).

Core Set Based Large Vector-Angular Region and Margin
In this section, LARM algorithm and its fast training method based on core set are proposed for novelty detection with imbalanced data.

LARM.
To tackle the novelty detection problem on imbalanced data, the distribution of vector-angular and maximization of vector-angular margin are considered in this paper.Figure 1 illustrates the principle of LARM.Firstly, LARM is adopted to find an optimal vector k in the feature space, which attempts to maximize the vectorangular mean and minimize the vector-angular variance simultaneously.Here, the vector-angular expresses the length of projection of training pattern (x  ) onto the optimal vector k.Therefore, the normal and abnormal examples are well separated when projected onto the optimal vector k joining their large mean and small variance.
Secondly, for the learning problem on imbalanced data, the largest vector-angular region in the feature space is constructed to separate the normal data.Since the number of normal training patterns is sufficient, the largest vectorangular region is constructed accurately, which can minimize the chances of accepting the normal examples.Meanwhile, to achieve a favorable generalization performance, the vectorangular margin between the surface of this optimal vectorangular region and the abnormal data is maximized.
According to [18], k * for problem ( 5) is expressed as follows: Hence, X T k = X T X = K can be obtained, where K = X T X is the kernel matrix.Problem ( 5) can be formulated as follows: where Q = 4(K T K − (Ky)(Ky) T )/ 2 , q = −(Ky)/, and K : is the th column of K.

Dual Problem.
To investigate the problem with constraints described as (7), the Lagrangian function is constructed as follows: where  = [ 1 , . . .,   ] T and  = [ 1 , . . .,   ] T are Lagrange multipliers.The following equations can be obtained by making the partial derivatives of (, , , , , ) with respect to the primal variables to zero: Substituting ( 9)-( 13) into (8), the dual form can be obtained, which omits constants without influence on optimization: where H = YKQ −1 KY, p = −(He)/, and Q −1 is the inverse matrix of Q and e = [1, . . ., 1] T .The dual problem ( 14) is a QP problem, which has the same form as the dual of the ]-SVM [19,20].Therefore, the QP problem ( 14) can be easily solved by SMO algorithm in LIBSVM [21].
Suppose  * is the optimal vector of the dual problem (14).According to (13),  * can be expressed as follows: To compute  2 and  2 , two sets are considered: According to the Karush-Kuhn-Tucker (KKT) conditions and ( 11) and ( 12),   > 0,   = 0,   > 0, and   = 0 can be obtained.Hence, set  1 = | 1 | and  2 = | 2 |, and  2 and  2 can be expressed as 3.1.3.Decision Function.It can be seen that minimizing the cost function (5) will make the width of vector-angular region  2 and vector-angular margin  2 as large as possible.Meanwhile, the optimal vector k in feature space is found, which makes the normal and abnormal examples well separated when projected onto the optimal vector k joining their large mean and small variance.Therefore, the testing patterns can be classified in terms of the vector-angular between the vector k and the training patterns (x).The optimal separating hyperplane of SVM is w T (x)+ = 0, which is at the middle of the margin.Similarly, the separating hyperplane of LARM is defined at the center of the margin.Hence, for testing instance z ∈   , the decision function is expressed as follows:  9) and ( 10), the following formulas can be obtained: By using similar proof about ]-property in [19] and by making use of (20), inequalities ( 21) can be obtained: The inequalities (21) indicate that ] 1 (] + 1) ) is a lower bound of the fraction of support vectors in the normal (or abnormal) dataset and an upper bound of the fraction of misclassified patterns in the normal (or abnormal) dataset.The ]-property of LARM can be used for parameter selection in the following experiments.

Core Set Based LARM.
As mentioned above, the dual problem of LARM can be actually formulated as a QP problem.So, solving the corresponding QP problem of LARM takes ( 3 ) computational time complexity and ( 2 ) space complexity.When the number of training patterns is large, it is thus computationally infeasible.Inspired from the core set based approximate MEB algorithms, (1 + ) and (1 − )-approximation algorithm is utilized for fast training LARM problem, which is called core set based LARM.Firstly, core sets of training patterns are obtained by (1 + ) and (1−)-approximation algorithm to achieve the distribution of vector-angular region of the normal and abnormal examples.The core set is a subset of the original training patterns and the optimization problem can be approximately solved on the core set.Secondly, the LARM problem is solved by SMO algorithm [22] using the obtained core set.According to [12,13], the number of core sets is independent of both the number and the dimension of training patterns, and the time complexity is linear to the number of training patterns while the space complexity is independent of the number of training patterns.The schematic illustration of core set based LARM is shown in Figure 2.
Suppose   is the core set of the th iteration, k  is the optimal vector in the feature space of the th iteration, π is the minimum distance between the center of the vectorangular margin and any point in core set of the th iteration, and    is the maximum distance between the center of the vector-angular margin and any point in core set of the th iteration.Given  > 0, according to [12,13], the core set based LARM is trained as follows.(iii) Find z  and z  ; k  T (z  ) is the furthest away from the center of the vector-angular margin and k  T (z  ) is the shortest away from the center of the vectorangular margin.Set  +1 =   ∪ {z  , z  }.
The distance between the center of the vector-angular margin and any point (z ℓ ) is expressed as follows: where   2 is the width of vector-angular region at the th iteration,   2 is the vector-angular margin at the th iteration, and the set {z ℓ } is constructed by all training patterns outside the vector-angular region [(1 − ) × π , (1 + ) ×    ].Computing (22) for all  training patterns, takes (|  | 2 + |  |) = (|  |) time at the th iteration.When  is large, time cost will be enormous.In order to reduce the computation cost, the probabilistic speedup method [23] is used to accelerate the vector-angular computations in steps (ii) and (iii).The details of time and space complexities can be seen in [12,13].
(v) Increase  by 1 and go back to step (ii).
(vi) Solve the LARM problem ( 14) by the core set   .
(vii) Classify the test pattern by the decision function (19).

Experimental Results
The proposed core set based LARM is evaluated on twenty datasets, including both LIBSVM datasets [24] and UCI datasets [25].Details of the datasets are listed in  [14,26,27], and it considers the classification results on both the positive and the negative classes.To make the experimental results persuasive enough, all the parameters of ]-SVM, SVDD, MAMC, and core set based LARM are selected by fivefold cross validation.
In all experiments, the Radial Basis Function (RBF) is taken as the kernel function: where  is the kernel parameter of the RBF.For all the algorithms, RBF parameter  is calculated by [12,13] where K , = x  T x  and diag(K) is the diagonal elements of matrix K.For ]-SVM, parameter ] is searched in {0.1, 0.01, 0.001, 0.0001}, where  = 1, 3, 5, 7, 9.

Parameters Influence.
There are five parameters in core set based LARM, that is, ], ] 1 , ] 2 , , and .To verify the influence of the parameters on the performance of core set based LARM, experiments on some representative datasets are performed.By fixing other parameters, the influence of every parameter on some representative datasets is further studied, which is shown in Figures 3-7.
Figure 3 shows the influence of ] on the geometric mean accuracy and the number of core sets by varying ] from 10 to 100 while fixing ] 1 , ] 2 , , and  as the suggested value obtained by the cross validation described in Section 4.1.Figure 4 shows the influence of ] 1 on the geometric mean accuracy and the number of core sets by varying ] 1 from 0.001 to 0.01 while fixing ], ] 2 , , and  in the same way.Figure 5 shows the influence of ] 2 on the geometric mean accuracy and the number of core sets by varying ] 2 from 0.001 to 0.01 while fixing ], ] 1 , , and  in the same way.Figure 6 shows   the influence of  on the geometric mean accuracy and the number of core sets by varying  from 2 −9 to 2 0 while fixing ], ] 1 , ] 2 , and  in the same way.Figure 7 shows the influence of  on the geometric mean accuracy and the number of core sets by varying  from 1 − 9 to 1 − 2 while fixing ], ] 1 , ] 2 , and  in the same way.
From Figures 3-7, it can be seen that parameters ], ] 1 , ] 2 , , and  have faint effect on the geometric mean accuracy and the number of core sets, which make the core set based LARM even more attractive in practice.Therefore, parameters ], ] 1 , ] 2 , , and  obtained by the cross validation described in Section 4.1 are acceptable for all experiments.

Detection Performance.
For each dataset, samples are randomly split into training patterns and testing patterns with the proportion described in Table 1.Parameters of ]-SVM, SVDD, MAMC, and core set based LARM are selected by fivefold cross validation to make the experimental results persuasive enough.
The geometric mean accuracy is used for the performance evaluation.Experiments are repeated for 10 times with random data partitions.The average accuracy and the standard deviation are listed in Table 2. NULL shows that   there is no return result in 10 hours.Furthermore, with regard to every dataset, the difference between the bold results and the best geometric mean accuracy is not significant, which is determined by the Wilcoxon rank-sum test, with the confidence level of 0.05.
From Table 2, it can be concluded that the performance of core set based LARM is comparable to the best of ]-SVM, SVDD, and MAMC on all datasets.The core set based LARM performs significantly better than ]-SVM, SVDD, and MAMC on 12, 9, and 13 over 20 datasets, respectively.It illustrates that, by using (1+) and (1−)-approximation algorithm for training LARM, the generalization performance of core set based LARM is comparable to or even better than the best of ]-SVM, SVDD, and MAMC.3. The average and standard deviation of testing time are shown in Table 4.All the experiments are conducted on the computer with an i5-2400@3.10GHz CPU and 8 GB SDRAM.NULL shows that there is no return result in 10 hours.Furthermore, with regard  As can be seen from Table 4, the best testing time of ]-SVM, SVDD, and MAMC performs slightly better than core set based LARM on 11 over 20 datasets; the longest time gap is 0.002 second.However, the testing time of core set based LARM is not the worst one.When the number of testing patterns is 353,349, such as Covtype, the average testing time of core set based LARM is about 1.5 seconds.It shows that the core set based LARM can detect testing examples fast.

Conclusion
In this paper, a novel LARM algorithm and its fast training method based on core set are proposed for novelty detection on imbalanced data.The proposed LARM algorithm combines the ideas of one-class and binary classification algorithms, which constructs the largest vector-angular region in the feature space to separate normal training patterns and maximizes the vector-angular margin between this optimal vector-angular region and the abnormal data.In order to make the generalization performance of LARM better, the vector-angular distribution is optimized by maximizing the vector-angular mean and minimizing the vector-angular variance.To improve the computation efficiency, (1 + ) and (1 − )-approximation algorithm is proposed for fast training LARM based on core set.The time and space complexity of core set based LARM are linear to and independent of the number of training patterns, respectively.Comprehensive experiments have validated the effectiveness of proposed approach.In the future, it will be interesting to extend the idea of LARM to handle one-class learning problem.

Figure 2 :
Figure 2: Schematic illustration of core set based LARM.
Influence of ] on the number of core sets

Figure 3 :
Figure 3: Influence of parameter ] on geometric mean accuracy and the number of core sets.

1 ( 1 (
a) Influence of ] 1 on geometric mean accuracy b) Influence of ] 1 on the number of core sets

Figure 4 :
Figure 4: Influence of parameter ] 1 on geometric mean accuracy and the number of core sets.

Figure 5 :
Figure 5: Influence of parameter ] 2 on geometric mean accuracy and the number of core sets.
Influence of  on the number of core sets

Figure 6 :
Figure 6: Influence of parameter  on geometric mean accuracy and the number of core sets.

Figure 7 :
Figure 7: Influence of parameter  on geometric mean accuracy and the number of core sets.

4. 3 . 2 .
Time Cost.The time cost of ]-SVM, SVDD, MAMC, and core set based LARM on different datasets is shown in Tables 3 and 4. The average and standard deviation of training time (including parameters selection and model training time) are shown in Table ) 3.1.4.]-Property.Let  + and  − represent the number of margin errors of the normal and abnormal training patterns and  + and  − denote the number of support vectors of the normal and abnormal training patterns, respectively.According to (

Table 1 ,
where  is the data dimension, #pos is the total number of normal patterns, #neg is the total number of abnormal patterns,  1 is the number of normal training patterns, and  2 is that of abnormal training patterns.The dataset size is ranged from 178 to more than 495,141, and the proportion of major and minor data is ranged from 10 : 1 to 1000 : 1. Experiments are repeated for 10 times with random data partitions, the geometric mean accuracy and the standard deviation are recorded.
4.1.Performance Measurement and Parameter Selection.The performance of core set based LARM is compared with three kernel-based algorithms: ]-SVM, SVDD, and MAMC.The geometric mean accuracy  = ( + ⋅  − ) 1/2 [26] is used for both parameter selection and algorithm evaluation, where  + is the classification accuracy of the positive class and  − is the classification accuracy of the negative class.The measurement is widely applied in imbalanced data

Table 1 :
Datasets used in experiments.

Table 2 :
Average geometric mean accuracy and standard deviation on datasets.

Table 3 :
Training time on different datasets., the difference between the bold results and the best time cost is not significant, which is determined by the Wilcoxon rank-sum test, with the confidence level of 0.05.From Table3, it can be clearly seen that the training time of core set based LARM is longer than the best of ]-SVM, SVDD, and MAMC, when the number of the training patterns is less than 2,143.However, when the number of training patterns is larger than 2,686 such as SDD, MC, Shuttle, Cod-rna, S. segmentation, and Covtype, the training time of core set based LARM is shorter than the best of ]-SVM, SVDD, and MAMC.When the number of training patterns increases to 141,792, the average training time of core

Table 4 :
Testing time on different datasets.based LARM does not exceed 65 seconds.Therefore, the training time of core set based LARM does not increase very quickly with the number of training patterns. set