Index Based Hidden Outlier Detection in Metric Space

1Guangdong Province Key Laboratory of Popular High Performance Computers, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, Guangdong 518060, China 2College of Information Engineering, Shenzhen University, Shenzhen, Guangdong 518060, China 3School of Mathematics and Big Data, Foshan University, Foshan, Guangdong 528000, China 4Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, School of Biomedical Engineering, Shenzhen University, Shenzhen 518060, China


Introduction
In recent years, the rapid growth of multidimensional data brings about increasing demand for knowledge discovery, in which outlier detection is an important task with wide applications, such as credit card fraud detection [1], online video detection [2], and network intrusion detection [3].
As a matter of necessity, the wide use of outlier detection has aroused enthusiasm in many researchers.Over the past few decades, many outlier definitions and detection algorithms have been proposed in the literature [4][5][6].Among these, statistics-based outlier detection [7,8], which initiated a new era for data mining, has attracted an intensive study.Nevertheless, statistics-based outlier detection always depends on the probability distribution of the dataset, which is always priorly unknown.
In order to overcome shortcoming of statistics-based method, Knorr and Ng et al. proposed distance-based definition and the corresponding detection method [9,10].According to his definition, an object O in a dataset T is DB(, ) outlier if at least fraction p of the objects in T lies greater than distance D from O [10].Soon afterwards, a number of distance-based definitions had been presented [4,6].Based on these definitions and their relevant detection methods, many researches significantly improved both outlier detection accuracy and speed.
However, by existing definitions, outliers tend to be far from their nearest neighbors.As a result, if a small number of outliers are close to each other but are far from other points, which also known as outlier cluster, they are not likely to be determined as outliers because of their adjacency to each other.In this case, these points are hidden by each other.In other words, outlier itself has bad influence on detection accuracy [11].In this paper, we propose a new perspective of hidden outlier that can uncover those hidden ones by traditional definitions and also design a new detection algorithm.Specifically, we make the following contributions: (1) A more accurate definition of outlier taking hidden outliers into consideration.

Related Work
As mentioned in the introduction, there are many definitions and detection algorithms for outliers, among which Hawkins's work [12] stands out as the first.Following this work, three major distance-based definitions of outliers have been proposed, namely, k-R outlier [9,10], k-distance-based outlier [13], and distance sum-based outlier [14].k-R outlier [9,10] in a dataset is a point that no more than k points are within distance R to it, denoted as  (,) thres [15].Please note that the definition of k-R outlier is essentially equivalent to definition DB(, ) [9,10], by which a point in a dataset T is an outlier if at least a fraction p of all points in T is beyond distance D to it.
k-distance-based [13] outlier in a dataset is the point whose distance to its kth nearest neighbor is the largest among all points.Usually, the distance of a point to its kth nearest neighbor is considered as its outlier degree, and the n points with the largest outlier degrees are determined as outliers, denoted as  (,)  max [15].The definition of distance sum-based outlier [14] considered the sum of distances from a point to its k nearest neighbors as its outlier degree, and the n points with the largest sums are determined as outliers which can be denoted as  (,)  sum .Similarly, distance average-based outlier, an equivalent definition in which the average distance to k nearest neighbors is considered, can be defined and denoted as  (,) avg [15].Because kNN method is easy to implement [16], the above three distance-based definitions become popular.
In addition, density-based outlier is actually distancebased outlier; for example, LOF [17], the most famous density-based definition, was accomplished in detecting outliers from a dataset with different densities, while ordinary definitions are not qualified for this work.
Distance-based outlier detection algorithm was always designed for one or several definitions.Over the past decades, many detection algorithms were proposed since the above definitions emerged, in which the most famous one is ORCA [18].ORCA has been honoured as the state-of-art algorithm [18] because of its simplicity and efficiency.What is more, it supports any outlier definitions with monotonically decreasing function of the nearest neighbor distances such as  (,)  max or  (,) avg .After that, many variants appeared in the literature, including solving set algorithm [19], MIRO [20], RCS [21], and iORCA [22].It should be noted that not all the variants support the same definitions as ORCA [18]; for instance, iORCA [22] could not support  (,)  avg definition.For speeding up kNN search [16], some researchers exploit other research methods, for example, index based algorithm.Knorr et al. first bring forward index based method.According to their description, R-tree and k-d tree can be applied [10].After that, k-d tree based method was proposed and had got a time complexity of () [23].Other detection algorithms, including HilOut based algorithm [24], P tree based algorithm [25], LSH based algorithm [26,27], and SFC index [28], use different index structure to speed up detection.Besides, sample method was applied to reduce distance computations [29].
k-R outlier also attracted many researchers since Knorr and Ng et al. proposed DB(, ) outlier definition and three relevant algorithms (nested-loop, index based, and cell-based algorithm) [9,10].For example, Tao et al. 's SNIF algorithm [30] can detect all outliers by scanning the dataset at most twice and even once in some cases.DOLPHIN algorithm [31] is also an efficient method to detect k-R outlier.However, k-R outlier has a shortcoming that parameters are difficult to set [13].
Since this paper aimed at outlier definition, we just took three most popular and simplest algorithms as comparison algorithm: ORCA [18], which is the state-of-art distancebased outlier detection algorithm, LOF [17], the most representative density-based outlier detection method, and iORCA [22], latest index based outlier detection algorithm in metric space.For ORCA and iORCA,  (,)  sum and  (,) max were used as outlier definition, respectively.

Definition of Hidden Outlier
In this section, we propose the new perspective of hidden outlier.The basic idea is that once an outlier is detected, exclude it from the nearest neighbors of other points.

Cause of Hidden Outlier.
As is shown in Figure 1, points A and B are outliers, but point C is not.In this case, if we set  = 2 and detect TOP 2 outliers by distance sum-based outlier definition, A and C will be detected as outliers, other than B, which is a real outlier.However, A and C will be detected as outliers, other than B, which is a real outlier.The reason is that the existence of A reduces the outlier degree of B, leading to the mistake in detection.

Definition of Hidden
Outlier.If a small number (e.g., equal to or greater than 2) of points are close to each other but are far to other points, they are usually deemed as outliers.However, traditional distance-based definitions of outliers determine an outlier by its distances to its nearest neighbors.
As a result, those small number of points close to each other are usually not outliers according to traditional definitions.In this situation, outliers seem to be hidden by each other, and they can be called hidden outliers.
Our definition of outliers which include hidden ones, denoted as  (,)  sum , is extended from the definition of distance-based outlier.Here we take distance sum-based outlier,  (,)  sum , as example to give detailed description.Notation summarizes the notations and their description.
Let D be a dataset, let dist be a distance function, let p be a point, and let   (, ) be the ith nearest neighbor of p in D. Please note that a point is not considered as a nearest neighbor of itself.Further, let  , be the ith outlier by  (,) sum definition, let   (, ) be the outlier degree of , and let  outlier be the set of  outliers of the largest values of outlier degree.According to the definition of  (,)  sum ,   (, ) is given by Let  , be the ith outlier by our new definition, let   (, ) be p's outlier degree, and let   -outlier be the set of first n outliers.Definition 1.The definition of outliers include hidden ones,  (,)  sum , defines the outliers in an iterative way: ( Definition 2. The outlier degree of p by  (,) sum definition,   (, ), is defined as In other words,   (, ) is still the sum of distances from p to its k nearest neighbors, while outliers already detected are not considered as p's nearest neighbors.Please note that the sequence of   ( , , ), 1 ≤  ≤ , might not be in descending order.
The other definition of outliers may be deduced by analogy.For example,  (,) max , we can simply displace the outlier degree of  (,)  sum as that of  (,) max , which is Then we can give the same definition of hidden outlier and its degree as Definitions 1 and 2. It is worth mentioning that, for simplicity,  (,)  sum based hidden outlier will be called ksum hidden outlier, while  (,)  max hidden outlier will be called kmax hidden outlier.Though both HOD (Section 4.1) and iHOD (Section 4.2) can apply these two definitions, for the limited space, they will only use one definition, respectively, in our experiments.

Hidden Outlier Detection Algorithm
In this section, the detection algorithms designed for hidden outlier are proposed, together with several theorems and definitions.

Hidden Outlier Detection Algorithm.
The detection algorithm of  (,)  sum is presented in this section.We start from a few theorems and concepts together which can be used to speed up the detection.Theorem 3. The hidden outlier degree is always no less than the outlier degree or   (, ) ≥   (, ).
The proof is straightforward from the definitions of both outlier degrees and is thus omitted.
( For convenience, we introduce two more definitions, maximum possible outlier degree and hidden outlier candidate set, and related theorems to show their properties.Definition 5.The maximum possible outlier degree of p,  , (, ), is the hidden outlier degree of p assuming that all the first  − 1 nearest neighbors of p are detected as outliers and thus are excluded from contributing to p's hidden outlier degree.That is, From Definition 5, we give the following obvious theorem without proof.
The following theorem shows the purpose in defining hidden outlier candidate set.Theorem 8.The hidden outlier candidate sets,  , (D), contains all of the outliers defined by  (,)  .
Our hidden outlier detection (HOD) algorithm (Algorithm 1) is based on the ORCA algorithm [18].HOD searches  +  − 1 nearest neighbors (in consideration of performance, n must be small) of every data block (lines (4)-( 7)).Once an object is impossible to become an outlier, it will be removed from data block (line (8)).After processing a data block, HOD gets current   -outlier (line (10)) and updates cutoff value c (line (11)) and then maintains hidden outlier candidate set at the same time (lines (12)-( 16)), including deleting the objects with smaller MPW than c (lines (13)-( 14)) and adding the objects with larger MPW than c (lines (15)-( 16)).Finally the candidate set is filtered to get the final results (Algorithm 2).
Algorithm 2 shows how the candidate set is filtered to get the final results. 1, is first moved from  , (D) to   -outlier (line (1)), because the first outlier of   -outlier and first outlier   -outlier are the same.Then the last outlier of   -outlier is checked to see whether it is a nearest neighbor of some objects in  , (D).If yes, it is deleted from that object' nearest neighbors and the hidden outlier degree   (, ) (lines (2)-( 4)) is updated.Next, the object with the largest hidden outlier degree in  , (D) is moved to   -outlier (line (5)), and the process goes back to line (2) until there are n outliers (line (6)).
As Bay and Schwabacher indicated, ORCA [18] has () time complexity on average and ( 2 ) in the worst case.In contrast to ORCA [18], HOD searches more nearest neighbors and thus costs more time and memory space.Further, HOD prunes less normal objects and results in more distance computations.However, these do not increase HOD's time complexity because as more data blocks have been processed, the cutoff value c becomes larger and more normal objects will be pruned.In other words, the pruning mechanism is the same as ORCA [18].After searching a data block's nearest neighbors, HOD maintains an outlier candidate set and updates it after updating c, which can be implemented by a fast priority queue.In addition, experimental results show that the candidate set size keeps constant as a whole with various dataset sizes, limiting the overhead of filtering process described in Algorithm 2.

Index Based Hidden Outlier Detection Algorithm.
In order to speed up HOD algorithm, we draw on the experience of iORCA and develop a simple index based hidden outlier detection algorithm.As we discussed in Section 2, iORCA is a very excellent distance-based improvement from ORCA, which is the state-of-the-art method in outlier detection.It is worth mentioning that the build index process (Algorithm 3) of iORCA is very simple and almost costs little time.However, the only fly in the ointment is that iORCA is short of pivot selection method.Addressing this shortcoming, we propose a simple but efficient pivot selection algorithm (Algorithm 4).
The most important idea of iORCA is to update the cutoff value faster.In Algorithm 3, we can see that, in order to achieve this goal, iORCA simply calculates the distance between every object to pivot (lines (2)-( 3)) and then sort them from large to small (line (4)).Obviously, both the time and memory consuming of this index are very small.
The largest difference between similarity search and outlier detection is that similarity search procedure may be used many times, for searching different object, while outlier detection procedure may be used for only several times, even just one time.That is a decision that their pivot selection method for building index will be different.Since similarity search always follows the rules of Offline Construction and Online Search [35,36], it can spend more time to select pivots and build index, other than outlier detection to save time in this stage.
As Bhaduri et al. said that if we select inlier points as pivots, the points farthest from them will be more likely to be the outliers [22].To benefit from this, the goal of our pivot selection algorithm is to select density points as pivots.However, the accurate search of density points is too expensive.So, in our pivot selection algorithm, we only find out the approximate density regions (lines (4)-( 5)) and then choose the middle point of these regions (line (6)).
In addition, we only take a part of dataset to select pivots, and usually a data block is enough.Due to the small percentage of outliers, it is almost impossible to select outliers as pivots via this subset using our method.The flowchart of iHOD algorithm is shown in Figure 2.
The stopping rule (Theorem 9) of iORCA can be applied in iHOD as well to quickly stop updating the cutoff value.
Theorem 10 can be proved similar to Theorem 9. are picked from the UCI Machine Learning Repository [38], and then a few objects from a class of small cardinality together with other larger classes are finally picked as the test suite.The first dataset is from the KDD Cup 1999 dataset, used by Yang [37].The first 20,000 TCP data of its ten percent version, which has 76 abnormal connections or outliers, are picked.The second dataset is from the Optdigits (short for Optical Recognition of Handwritten Digits) dataset.The first 10 records from class 0, served as outliers, together with other 3,447 records are picked.The third one is from the Breast Cancer Wisconsin (Diagnostic) dataset.The first 10 records of the malignant ones, served as outliers, are picked, making 367 as the total number of records.The last one is from the Heart Disease dataset.All records in class 4, serving as outliers, together with other records except in classes 1-3 are picked.Altogether, there are 203 records with 15 outliers in dataset 4.
After a simple constructing, these four datasets are match for Hawkins's outlier definition [12], which is generally acceptable.

Evaluation Methodology.
In statistics, a receiver operating characteristic (ROC) or ROC curve is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied.The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.Calculate the Area Under Curve (AUC), and the larger it is, the better it is.
With regard to outlier detection, as mentioned before, let D be the dataset, let |D| be the dataset size, and let n be the various threshold setting parameter.In other words, when n is set as a specific value, the top-n outliers will be identified as outliers, and other ||− points in the dataset are regarded as inliers.So, for a specific n, we can get a couple value of TPR and FPR.Vary n from 0 to |D|; we can easily get the ROC curve and then calculate the AUC value.
Obviously, the above calculation is in the case that neighbor number k is specific.If we vary it, then we can get the AUC-k graphical plot, which make us easily see the outlier detection accuracy with regard to parameter k.

Experimental Results
. The experiments were run on a desktop with Intel Core 3.40 GHz processor, 8 GB RAM, running Windows 7 Professional Service Pack 1.The algorithms were implemented in C++ and compiled with Visual Studio 2012 in Release Mode.Unless otherwise specified, both the parameters k in determining nearest neighbors and detecting outlier number n are set as 10.The accuracy is first studied, then the running time and performance of pivot selection algorithm at last.

Accuracy.
The accuracy of the five algorithms, namely, HOD, ORCA [18], iHOD, iORCA [22], and LOF [17], is first studied by way of AUC.For each constant value of k, the AUC values of the three algorithms with various values of n are computed.Then the AUC values of various values of k are plotted in Figure 3. Please note that ORCA uses  (,) sum definition, and HOD uses its extensional ksum hidden outlier, while iORCA uses  (,)  max definition, and iHOD uses kmax hidden outlier.It can be seen that HOD and ORCA achieve higher AUC values than LOF for almost all of the cases, for the four datasets.Further, HOD is better than ORCA for most of the cases.For the cases when ORCA is better than HOD, the values are close to 1, and the difference is thus marginal.Also, we can see the appearance of iHOD and iORCA.
To see the difference more clearly, we further compare the three algorithms by way of true positives (Figure 4).Obviously, iHOD is the best in most cases while LOF is the worst for almost all of the cases.Particularly, the true positives of LOF are close to 0 for the KDD Cup 1999 and Optdigits dataset.In Figure 4(c), we can see that HOD and iHOD have got the true positives similar to ORCA and iORCA, respectively.This is because Breast Cancer Wisconsin (Diagnostic) dataset is free of outliers hidden by other ones.
In order to see more details about the performance of iHOD, we also do some Non-Parametric Tests [39], including Friedman's test [40] and Hochberg's procedure [41].However, even though Hochberg's testing results show that the difference between iHOD and iORCA is insignificant, it can be seen from Figures 3 and 4 that iHOD has actually made improvements from iORCA, which is an excellent outlier detection algorithm.Table 1, which is the rankings of average true positives from Figure 4, also shows that iHOD outperforms iORCA.

Efficiency.
As mentioned earlier, an extra piece of work of HOD and iHOD is to filter through the candidate set, which may cost much time if the candidate set size is too large.The candidate size with respect to values of n (Figure 5) and dataset size (Figure 6) are plotted.The Breast Cancer Wisconsin (Diagnostic) dataset and the Heart Disease dataset are not included due to their tiny sizes.Clearly, the candidate set size is not strongly correlated to the dataset size, making its effect to the running time limited.
The running time of the seven algorithms for the KDD Cup 1999 dataset and the Optdigits dataset is listed in Figure 7, respectively.Please note that HOD has better accuracy than ORCA [18] and LOF [17], while iHOD is better than almost all algorithms, as indicated in last section.To make the comparison fair, we extend ORCA and iORCA [22] to HORCA and HiORCA to achieve the same detection results of HOD and iHOD.That is, run ORCA or iORCA n times, while each time only one outlier is detected and is removed from following runs.The running time of HORCA and HiORCA is also listed in the tables.
The figure shows clearly that the running time of LOF increases dramatically as the dataset size increases.HOD, with better accuracy, takes almost 1.5 times of the time as that of ORCA [18].However, HORCA, with the equivalent accuracy as that of HOD, takes up to 9 times of the running time as that of HOD.To our surprise, iHOD runs even faster than iORCA [22].This is because iHOD uses Theorem 10 to get a large reduction of distance calculation times.

Performance of Pivot Selection Algorithm.
In order to estimate the performance of pivot selection algorithm, we compare iORCA/iHOD with/without pivot selection algorithm, in both running time and distance calculation times.For the sake of accuracy, we ran every experiment setup for 10 times and then get their standard deviation and mean value.
From Tables 2 and 3, we can see that our pivot selection algorithm makes the experiment results very stable and even gets better results in mean value of running time and distance calculation times.
As other popular outlier detection algorithms like ORCA [18], iORCA [22], and LOF [17], both HOD and iHOD have two parameters, which are nearest neighbor number k and outlier number n.However, for these two parameters, it is unwise to decrease their value in order to get improvement in efficiency, because it may seriously reduce the detection quality.In fact, k can be set refer to  (,)  avg an so on.Though n is in proportion to candidate set size, thereby increasing the running time.Fortunately, n depends on user's requirement, so it is usually not large.

Analysis of Mistakenly Detection.
As we all know, HOD and iHOD algorithms detect next outlier after excluding the detected outliers, so that they can detect more real outliers.Is there any possibility that the normal objects are mistakenly detected as outliers?
To analyze easily, we let D be the dataset, let  (,) max be the original outlier definition, and let d be the dimension number of datasets; region  1 with radius of r consists of n 1 objects, which include outlier  1 .Another region,  2 , with the same radius, consists of n 2 objects, including normal object  2 .According to distance-based outlier definition, there is  1 <  2 .
The expected outlier degree of  1 and  2 ,  1 (  ) and  2 (  ), is as follows: After taking out m objects from region  1 and  2 separately, the expected outlier degree of  1 and  2 becomes The variation of  1 and  2 's outlier degree should be For ease of analysis, we can set So that Because Δ 1 > 0, Δ 2 > 0, we can compare their value via their ratio Obviously, from formula ( 16), there is Since  1 <  2 , there is As a result, In other words, the variation of outlier is larger than normal object, so that it is easier to be detected as outlier.However, owing to the complexity of dataset distribution and the limitations of distance-based outlier definition, it remains possible that a very few normal objects may be mistakenly detected.

Conclusions and Future Work
In this paper, we propose a new distance-based outlier definition to detect hidden outliers and design a corresponding detection algorithm, which excludes detected outliers to reduce their influence and improve the accuracy.Candidate set based method also avoids repeating detecting the whole dataset.Experimental results show that our HOD algorithm is more accurate than ORCA [18] and LOF [17].Further, with better accuracy, HOD is much faster than LOF and is of comparable speed to that of ORCA.In addition, to achieve the same accuracy, ORCA takes up to 9 more times of running time than HOD.With the help of Triangle Inequality to reduce the distance calculation times, iHOD gets a much faster speed than iORCA and HiORCA.Moreover, we develop a pivot selection algorithm to avoid choosing outliers as pivots, and in experiments, we also show that our method can achieve stability of the results.
Our definition and algorithm are extended from two distance-based outlier detection algorithms: ORCA [18] and iORCA [22].The case of hidden outlier also exists for densitybased outlier detection, which we will study in the future.Besides, we will further study how to choose right pivots to speed up our detection algorithm.A data block, that is, a set of objects L:

Notation
Index, for example, (), denotes index of the dataset D blockIndex: Index of a data block partNum: N umberofsegments/partitions pivotNum: Numberofpivots.Inthispaper ,it can only be set as V = 1.

Figure 2 :
Figure 2: The flow chart of iHOD algorithm.

Figure 3 :
Figure 3: AUC values with various values k of five algorithms for four datasets.

Figure 4 :
Figure 4: True positive values with various values n of five algorithms for four datasets.

Figure 5 :
Figure 5: HOD's candidate set size with various values of n.

Figure 6 :
Figure 6: HOD's candidate set size with various dataset sizes.

Table 1 :
Rankings obtained through Friedman's test over average true positives of Figure4.

Table 2 :
Standard deviation and mean value of running time (milliseconds) over KDD Cup 1999 dataset.

Table 3 :
Standard deviation and mean value of distance calculation times (10 4 times) over KDD Cup 1999 dataset.
Outlier degree of p in dataset D   -outlier: The set of top-n outliers with respect to traditional outlier definition  , : Th eith outlier by hidden outlier definition   (, ): p's hidden outlier degree in dataset D; see details in Definition 2   -outlier: The set of top-n hidden outliers  , (, ): Maximum possible outlier degree of p, with regard to dataset D. See details in Definition 5  , (): Hidden outlier candidate set, each object of which has larger maximum possible outlier degree than cutoff value c : C u t o ff v a l u e , e q u a l t o t h e o u t l i e r degree of current th outlier.Namely,  =   ( , , ) or  =   ( , , ) depends on the outlier definition B: