A Quick Negative Selection Algorithm for One-Class Classification in Big Data Era

Negative selection algorithm (NSA) is an important kind of the one-class classification model, but it is limited in the big data era due to its low efficiency. In this paper, we propose a new NSA based on Voronoi diagrams: VorNSA. The scheme of the detector generation process is changed from the traditional “Random-Discard” model to the “Computing-Designated” model by VorNSA. Furthermore, we present an immune detection process of VorNSA underMap/Reduce framework (VorNSA/MR) to further reduce the time consumption onmassive data in the testing stage.Theoretical analyses show that the time complexity of VorNSA decreases from the exponential level to the logarithmic level. Experiments are performed to compare the proposed technique with other NSAs and one-class classifiers. The results show that the time cost of the VorNSA is averagely decreased by 87.5% compared with traditional NSAs in UCI skin dataset.


Introduction
NSA was proposed by Forrest et al. in 1994 [1], which generates immune detectors based on the "Random-Discard" model.Initially, massive immature detectors are randomly generated, and then the ones covering the self-areas are discarded.González et al. presented the real-valued negative selection algorithm (RNSA) in 2003 [2], in which the detectors and antigens are studied in the real-value space.Ji and Dasgupta proposed V-Detector algorithm [3,4].It turns the fixed-length detectors in RNSA into the variable-sized detectors to enlarge the detection areas.In 2015, Cui et al. developed BIORV-NSA [5].In their work, the self-radius can be variable and the detectors, which are recognized by other mature detectors, are replaced by new ones to eliminate the "detection holds." In big data era, the low efficiency of NSA becomes an important challenge, which largely limits its applications.In this paper, we design a new NSA based on Voronoi diagrams, named VorNSA.In the VorNSA, a restrained Voronoi diagram is constructed based on the whole training set in the first step.Then, two types of detectors are generated in the specific location of the Voronoi diagram separately.In order to accelerate the test stage of NSA, in particular for large scale dataset, a new testing strategy VorNSA/MR (VorNSA with Map-Reduce) is proposed.Unlike the testing stage of classic NSAs, data are divided into small groups and calculated to generate the labels separately in Map stage.Then the final labels can be obtained after merging and sorting in the Reduce stage.
The contributions of this work can be summarized as follows.(1) Based on Voronoi diagrams, the optimal position of detectors is calculated directly rather than in a stochastic way.Therefore, the time consumption wasted on excessive invalid detectors is avoided.(2) In the Map/Reduce framework, data are partitioned into several small parts by VorNSA/MR and can be processed in parallel to enhance the self/non-selfdiscrimination efficiency.
The rest of the paper is organized as follows.In Section 2, we describe the definitions of VorNSA.The original contribution of the paper is presented in Section 3. Experimental results on synthetic datasets and real-world datasets are 2 Mathematical Problems in Engineering shown and discussed in Section 4. Conclusions appear in Section 5.

Basic Definition of VorNSA
VorNSA is designed based on Voronoi, which is derived from computation geometry to search the nearest neighbors, and it has been widely utilized in the fields of life sciences [6], material sciences [7], and mobile navigation [8].The basic definitions are listed as follows.
Definition 1 (site).Site is a set of  distinct points in the feature space.In VorNSA, all the training samples are defined as site points:  = { 1 ,  2 , . . .,   }.
Definition 2 (Voronoi diagram).Vor() divides the feature space into  unoverlapped cells based on the given site set , and each cell ](  ) only contains one site   in , such that any point  in ](  ) satisfies dist(,   ) < dist(,   ) ∀  ∈ ,  ̸ = , and dist( ) can be any distance metrics.Definition 3 (cell).All the cells construct a mathematic partition of the feature space, and the cell corresponding to site   is denoted by ](  ).
Definition 4 (largest empty circle).The largest circle with center , which does not contain any site in , is denoted by   ().Theorem 5. A point  is a vertex of Vor() iff   () contains at least three sites on its boundary [9].Definition 6 (I-detector).⟨, ⟩, where  is the detector position in the feature space, and  is the detector radius, satisfies that  corresponds to one vertex of the Voronoi diagram.

Theorem 7.
Given  is the center of an I-detector, there are at least three sites located on the boundary of   (), and these sites are the nearest neighbors of each other.
Proof.According to Definitions 2 and 6, it can be inferred that the center of the I-detector p is an intersection of three or more cells.Suppose that  is intersected by three cells ](  ), ](  ), ](  ), while the sites of these cells are   ,   ,   .According to Definition 4 and Theorem 5, there is a largest empty circle   () that does not contain any site of , and   ,   ,   are located on its boundary.So   ,   , and   are the nearest sites of  among the site sets .Theorem 8.The bisector between sites   and   defines an edge of Vor() iff there is a point  on the bisector such that   () contains both   and   on its boundary with no other site [9].Definition 9 (II-detector).⟨, ⟩, where  is the detector position in the feature space, and  is the detector radius, satisfies that  corresponds to the junction of the edges of Vor() and the unit hypercube [0, 1]  .Theorem 10.Given  is the center of II-detector, there are two sites located on the boundary of   (), and these sites are the nearest neighbors of each other.
Proof.According to Definitions 2 and 9, it can be inferred that the center of II-detector  is an intersection of two cells.Suppose that  is intersected by two cells ](  ), ](  ), while the sites of these cells are   ,   .According to Definition 4 and Theorem 8, there is a largest empty circle   () that does not contain any site of , and   ,   are located on its boundary.So   and   are the nearest sites of  among the site sets .
As an example in Figure 1, there are 10 sites  1 - 10 in set , and the space is divided into 10 cells ]( 1 )-]( 10 ) by the Voronoi diagram Vor().The green circle is   (), and three sites ( 2 ,  8 ,  10 ) are located on its boundary.The red circle is   (), and two sites ( 8 ,  10 ) are located on the boundary.The purple circle is   (), and two sites ( 2 ,  8 ) are located on the boundary. is the center of I-detector, while  and  are the centers of II-detector.

The Details of VorNSA
compute the detector radius   using Eq.(1) (10) if  >  then compute the detector radius   using Eq. ( 2) (16) if = }, where Vet(  ), Vet(  ), and Vet(  ) are the vertex sets of cell.Then, generating a mature detector is just through self-tolerating with   .According to the principle of selftolerance, the radius of I-detector can be calculated with where   is the radius of I-detector, Vet(  ) is the center of I-detector,   is the nearest sites, and   is the radius of selfantigens.Furthermore, a threshold  of detector radius is introduced in case of overfitting: If the detector radius   is less than , the detector will be discarded.Otherwise, it will mature.= }, where Vet(  ) and Vet(  ) are the vertex sets of cell.Similarly, the radius of II-detector can be computed by (2), and a threshold  of detector radius is introduced in case of overfitting.

II-Detector Generation
where   is the radius of II-detector, Vet(  ) is the position of II-detector and   is the nearest sites, and   is the radius of self-antigens.Details of the VorNSA can be found in Algorithm 1.

The Immune Detection Process of VorNSA under
Map/Reduce Framework.In the testing stage of traditional NSAs, each piece of data has to be compared with all the detectors to label its classification.This strategy is too time-consuming to be applied in big data era due to its low efficiency.In order to enhance the efficiency in testing stage, an immune detection process of VorNSA under Map/Reduce framework (VorNSA/MR) is proposed.Map/Reduce is a parallel computation framework, which splits the sample set into a group of small datasets and handles them on many cluster nodes simultaneously.Details of VorNSA/MR (Figure 2) are mainly divided into two parts: Map stage and Reduce stage.First of all, the testing datasets are split into  parts by VorNSA/MR.In the Map stage, each cluster node selects a part of split data to compute the distance with matured detectors.If any distance is less than the detection radius, the testing sample is labeled with the non-self-antigens; otherwise it is labeled with the selfantigens.Then cluster nodes put results to the intermediate value.The Reducer receives the intermediate values, sorts them, and merges them into the final results.
The implements of Map and Reduce stage can be found in Algorithms 2 and 3.   ---------------- Proof.Since VorNSA is divided into three stages, we could analyze the time complexity separately.

Theoretical Analysis
The main work in space partition stage is to build a Voronoi diagram, so we borrow the analysis from Voronoi diagrams to estimate the time complexity.The literatures [9][10][11][12] prove that a Voronoi diagram with  sites can be computed in ( log  +  ⌈/2⌉ ) optimal time under -dimension space.Therefore, the time complexity can be denoted by (  log   +   ⌈/2⌉ ), where   is the size of training set, and  is the dimension of training set.
In the second and third stage, the main work is to compute the distance between detectors and sites.Though several detectors are discarded by the threshold , the quantity is very small compared with the whole size, so we use the size of detectors || instead.According to (1) and ( 2), we can infer that the time complexity is (||) in the two stages.
Combining the abovementioned, the time complexity of VorNSA is (  log   +   ⌈/2⌉ + ||).The time complexity of traditional NSAs is shown in Table 1, where   is the match probability between detectors and antigens,   is the failure rate,   is the size of selfset, || is the size of detectors, and  is the data dimension.As shown in Table 1, the time complexity of VorNSA is in logarithmic level with   , which is much less than the traditional exponential level compared with NNSA [1], RNSA [2], and V-Detector [4].

Experiments and Discussion
In the experiments, we use two evaluation criteria of performance: DR (Detection Rate) and FAR (False Alarm Rate) which is reported in varied literature [2,3,13], and they are defined as Mathematical Problems in Engineering 5  where TP and FN are the counts of true positive and false negative of non-self-antigens, respectively, and TN and FP represent the number of true negative and false positive of self-antigens, respectively.

Experiments on Synthetic Dataset (SDS).
In order to determine the performance of VorNSA among different datasets, 4 SDS proposed by the intelligence security laboratory of Memphis University are introduced in this section.The records of original datasets [3] are 1000, respectively.We expand the number of pieces of data to 10,000 to simulate the environment of big data better.The distributions of datasets are depicted as Figure 3 in which self-antigens are represented by red dots and non-self-antigens are shown by blue points.The details of datasets are listed in Table 2.
Additionally, experiment parameters are set as follows: the self-radius is 0.04, self-antigens are randomly obtained from 50 to 1000, and the minimum radius of detectors is 0.005.Each experiment is repeated 25 times independently.As Figure 4 shows, the trends of experiment results on 4 SDS are approximately the same.It indicates that VorNSA could achieve a high degree of applicability on different datasets.In Figure 4(a), it can be observed that the DR decreases from 95% to 80% with the increment of selfantigens.Besides, in Figure 4(b), the FAR drops from 60% to zeros.The reasons of this phenomenon can be explained as follows: when less self-antigens are trained, some selfantigens cannot be covered by the scope of self.So these self-antigens are identified as non-self-antigens in VorNSA.Due to its strong ability in detecting, the DR and FAR are both high.With the increase of the training numbers, all selfantigens will be covered.Furthermore, the non-self-antigens are covered and identified as self-antigens, in particular those located in the edge of self-set.Therefore, the DR decreases slightly while FAR sharply drops to zeros.
Figure 4(c) shows the quantity of detectors generated by VorNSA is not increasing remarkably with the growth of train set but maintains a relatively stable range.It is implied that VorNSA can effectively control the expansion of detectors.According to Definition 2, with the increment of training samples, the space will be partitioned into smaller cells.
We introduce the minimum detector radius .Thus, the inefficient tiny detectors are discarded.
In Figure 4(d), it can be noted that the time consumption of VorNSA on different datasets is similar, and time cost rises slowly even with enormous self-antigens.It suggests that the performance of VorNSA is less affected by the distribution of dataset, because the optimal position of detectors is calculated directly rather than in a stochastic way.
To sum up, we can see that VorNSA can generate fewer but more effective detectors.Besides, the less self-antigens are trained, the higher FAR will be.With the number of self-antigens increasing, the FAR is decreased significantly.Increasing the training set will lead to a rise of the time consumption, and the DR will be slightly decreased.Hence, a smaller self-set will be a smart choose in VorNSA.To study the different methods, we introduce a classic statistics algorithm for one-class classification: OC-SVM [14], which is implemented by LibSVM [15].All algorithms run in a computer deployed with Intel Pentium E6600@3.06G, while the implement of VorNSA refers to an open source toolbox of computational geometry, called MPT 3.0 [16].
The Skin Segmentation dataset is a UCI dataset.It is collected by randomly sampling B, G, and R values of skin texture, which derives from FERET database and PAL database.Total sample size is 245,057 in which 50,859 records are the skin samples and 194,198 records are non-skin ones.
In this experiment, 50 skin samples are randomly obtained as self-antigens.Meanwhile, to verify the performances of VorNSA and VorNSA/MR in large scale dataset, we use all 245,057 records in the datasets.The experiments are preformed 20 times independently, and the evaluation criteria include DR, FAR, detector number (DN), data training time (DT), and data testing time (DTT).The parameters of simulation are set as follows: the OC-SVM uses the RBF kernel functions, and nu is 0.5 and gamma is 0.33.The selfradius of RNSA, V-Detector, and VorNSA are set as the same value (0.1).The maximum number of detectors is 3000 in RNSA, and detector radius is 0.1.The estimated coverage and the maximum self-coverage are 99%.The maximum number of detectors is 1000 in BIORV-NSA, and the self-set edge inhibition parameter is 0.8 and the detector self-inhibition parameter is 1.2.The minimum radius of detectors is 0.005 in VorNSA and VorNSA/MR.The results of experiments are shown in Table 3.
From Table 3, it can be seen that the FAR of OC-SVM is 51.2%, reaching an unacceptable level.As OC-SVM implemented in a different platform, the time consumption is not counted in this paper.The DR of VorNSA (99.2%) is closed to the BIORV-NSA (99.42%), and better than the classic NSAs.Besides, the FAR of VorNSA (1.48%) is lower than BIORV-NSA (3.29%).It indicates that the detectors generated by VorNSA are more applicable than BIORV-NSA and more effective than classic NSAs.Moreover, the DN, DT, and DTT of VorNSA are significantly lower than other NSAs, especially when it integrates the Map-Reduce Testing Framework.For example, the average number of detectors generated by VorNSA is 172.25, lower 63.3% by V-Detector and 82.8% by BIORV-NSA.The average training time of VorNSA is 1.91, lower 78% by RNSA, 94.1% by V-Detector, and 90.5% by BIORV-NSA.So the efficiency of VorNSA is averagely decreased by 87.5% compared with traditional NSAs.The testing time of VorNSA/MR is 426.7,lower 36.4% by VorNSA, 55% by V-Detector, 77.8% by BIORV-NSA, and 94.3% by RNSA.
The main reasons of above results can be explained as follows.In traditional NSAs, a large number of immature detectors are randomly generated without any optimal way and must self-tolerate with all self-antigens to decide whether they are matured or not.As a result, much time has been wasted.The scheme of detector generation of VorNSA is quite different with other NSAs.The optimal position of detectors is directly calculated.Thus, the time consumption on discarding many randomly generated but inappropriate detectors is avoided.

Conclusions
In this paper, we propose a new one-class classification algorithm based on Voronoi diagrams (VorNSA) and an immune detection process of VorNSA under Map/Reduce framework (VorNSA/MR) to cope with the challenge of big data.VorNSA alters the generative mechanism of detector from the "Random-Discard" model to the "Computing-Designated" model.VorNSA/MR can divide the sample set into several small parts and can be processed in parallel.Theoretical analyses show that the time complexity of VorNSA decreases from the exponential level to the logarithmic level.Experiments results show that the time consumption of VorNSA is significantly declined.

Figure 1 :
Figure 1: The red cross is the site.The blue line is the Voronoi diagram.The circle is the largest empty circle.

4. 2 .
Experiments on Skin Segmentation Dataset.In this section, VorNSA is tested by a group of comparison experiments.The compared algorithms include the classic NSAs (RNSA, V-Detector), a newly proposed NSA (BIORV-NSA) in 2015.
number (N S ) 0 (d) Detector train time

Figure 4 :
Figure 4: Results with different training samples.

Theorem 11 .
The time complexity of VorNSA is (  log   + ⌈/2⌉ + ||), where   is the size of training dataset,  is the dimension of training dataset, and || is the size of detectors.

Table 1 :
The complexity of NSAs.

Table 2 :
The detail of 4 SDS.