Useless and noise information occupies large amount of big data, which increases our difficulty to extract worthy information. Therefore outlier detection attracts much attention recently, but if two points are far from other points but are relatively close to each other, they are less likely to be detected as outliers because of their adjacency to each other. In this situation, outliers are hidden by each other. In this paper, we propose a new perspective of hidden outlier. Experimental results show that it is more accurate than existing distance-based definitions of outliers. Accordingly, we exploit a candidate set based hidden outlier detection (HOD) algorithm. HOD algorithm achieves higher accuracy with comparable running time. Further, we develop an index based HOD (iHOD) algorithm to get higher detection speed.
In recent years, the rapid growth of multidimensional data brings about increasing demand for knowledge discovery, in which outlier detection is an important task with wide applications, such as credit card fraud detection [
As a matter of necessity, the wide use of outlier detection has aroused enthusiasm in many researchers. Over the past few decades, many outlier definitions and detection algorithms have been proposed in the literature [
In order to overcome shortcoming of statistics-based method, Knorr and Ng et al. proposed distance-based definition and the corresponding detection method [
However, by existing definitions, outliers tend to be far from their nearest neighbors. As a result, if a small number of outliers are close to each other but are far from other points, which also known as outlier cluster, they are not likely to be determined as outliers because of their adjacency to each other. In this case, these points are hidden by each other. In other words, outlier itself has bad influence on detection accuracy [ A more accurate definition of outlier taking hidden outliers into consideration. An effective detection algorithm for the definition proposed, with higher accuracy only at the cost of a little more time. An index based HOD (iHOD) algorithm and getting much higher detection speed. A simple but effective algorithm to select pivot for iHOD algorithm to avoid selecting outliers as pivots.
The rest of this paper is organized as follows. In Section
As mentioned in the introduction, there are many definitions and detection algorithms for outliers, among which Hawkins’s work [
The definition of distance sum-based outlier [
In addition, density-based outlier is actually distance-based outlier; for example, LOF [
Distance-based outlier detection algorithm was always designed for one or several definitions. Over the past decades, many detection algorithms were proposed since the above definitions emerged, in which the most famous one is ORCA [
For speeding up
Similarly, the research works for density-based detection algorithm started from their incipient definition, which is LOF [
Since this paper aimed at outlier definition, we just took three most popular and simplest algorithms as comparison algorithm: ORCA [
In this section, we propose the new perspective of hidden outlier. The basic idea is that once an outlier is detected, exclude it from the nearest neighbors of other points.
As is shown in Figure
A simple 2-dimension dataset.
If a small number (e.g., equal to or greater than 2) of points are close to each other but are far to other points, they are usually deemed as outliers. However, traditional distance-based definitions of outliers determine an outlier by its distances to its nearest neighbors. As a result, those small number of points close to each other are usually not outliers according to traditional definitions. In this situation, outliers seem to be hidden by each other, and they can be called hidden outliers.
Our definition of outliers which include hidden ones, denoted as
Let
Let
The definition of outliers include hidden ones,
The outlier degree of
In other words,
The other definition of outliers may be deduced by analogy. For example,
Then we can give the same definition of hidden outlier and its degree as Definitions
In this section, the detection algorithms designed for hidden outlier are proposed, together with several theorems and definitions.
The detection algorithm of
The hidden outlier degree is always no less than the outlier degree or
The proof is straightforward from the definitions of both outlier degrees and is thus omitted.
The
(
(
(a) If
(b) Otherwise, there must exist
For convenience, we introduce two more definitions,
The
The hidden outlier candidate set,
The following theorem shows the purpose in defining hidden outlier candidate set.
The hidden outlier candidate sets,
For an arbitrary outlier
From Theorem
As a result,
Our hidden outlier detection (HOD) algorithm (Algorithm
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) delete (15) (16) insert (17) (18) return
(1) (2) Set (3) (4) (5) (6) (7) return
Algorithm
As Bay and Schwabacher indicated, ORCA [
In order to speed up HOD algorithm, we draw on the experience of iORCA and develop a simple index based hidden outlier detection algorithm. As we discussed in Section
(1) (2) (3) (4) sort( //descending order (5) return
(1) (2) (3) blockIndex (4) divide blockIndex into and 10, every segment has the same number of points (5) get (6) (7) return
The most important idea of iORCA is to update the cutoff value faster. In Algorithm
The largest difference between similarity search and outlier detection is that similarity search procedure may be used many times, for searching different object, while outlier detection procedure may be used for only several times, even just one time. That is a decision that their pivot selection method for building index will be different. Since similarity search always follows the rules of Offline Construction and Online Search [
As Bhaduri et al. said that if we select inlier points as pivots, the points farthest from them will be more likely to be the outliers [
In addition, we only take a part of dataset to select pivots, and usually a data block is enough. Due to the small percentage of outliers, it is almost impossible to select outliers as pivots via this subset using our method. The flowchart of iHOD algorithm is shown in Figure
The flow chart of iHOD algorithm.
The stopping rule (Theorem
Let
In addition, we propose other rules to reduce the distance computing times.
If
Theorem
If
Using Triangle Inequality, there is
Our index based hidden outlier detection algorithm (Algorithm
(1) (2) pivot (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15) (16) (17) (18) (19) (20) (21) delete (22) (23) insert (24) (25) return
In this section, we compare HOD and iHOD algorithm with ORCA [
We follow a common way [
After a simple constructing, these four datasets are match for Hawkins’s outlier definition [
In statistics, a receiver operating characteristic (ROC) or ROC curve is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. Calculate the Area Under Curve (AUC), and the larger it is, the better it is.
With regard to outlier detection, as mentioned before, let
Obviously, the above calculation is in the case that neighbor number
The experiments were run on a desktop with Intel Core 3.40 GHz processor, 8 GB RAM, running Windows 7 Professional Service Pack 1. The algorithms were implemented in C++ and compiled with Visual Studio 2012 in Release Mode. Unless otherwise specified, both the parameters
The accuracy of the five algorithms, namely, HOD, ORCA [
AUC values with various values
KDD Cup 1999
Optdigits
Breast Cancer Wisconsin (Diagnostic)
Heart Disease Dataset
It can be seen that HOD and ORCA achieve higher AUC values than LOF for almost all of the cases, for the four datasets. Further, HOD is better than ORCA for most of the cases. For the cases when ORCA is better than HOD, the values are close to 1, and the difference is thus marginal. Also, we can see the appearance of iHOD and iORCA.
To see the difference more clearly, we further compare the three algorithms by way of true positives (Figure
True positive values with various values
KDD Cup 1999
Optdigits
Breast Cancer Wisconsin (Diagnostic)
Heart Disease Dataset
In order to see more details about the performance of iHOD, we also do some Non-Parametric Tests [
Rankings obtained through Friedman’s test over average true positives of Figure
Algorithm | ORCA | HOD | LOF | iORCA | iHOD |
|
|||||
Ranking | 3.75 | 3.25 | 5 | 1.875 | 1.125 |
As mentioned earlier, an extra piece of work of HOD and iHOD is to filter through the candidate set, which may cost much time if the candidate set size is too large. The candidate size with respect to values of
HOD’s candidate set size with various values of
KDD Cup 1999
Optdigits
HOD’s candidate set size with various dataset sizes.
KDD Cup 1999
Optdigits
The running time of the seven algorithms for the KDD Cup 1999 dataset and the Optdigits dataset is listed in Figure
Running time (milliseconds) over KDD Cup 1999 and Optdigits datasets with
KDD Cup 1999
Optdigits
The figure shows clearly that the running time of LOF increases dramatically as the dataset size increases. HOD, with better accuracy, takes almost 1.5 times of the time as that of ORCA [
In order to estimate the performance of pivot selection algorithm, we compare iORCA/iHOD with/without pivot selection algorithm, in both running time and distance calculation times. For the sake of accuracy, we ran every experiment setup for 10 times and then get their standard deviation and mean value.
From Tables
Standard deviation and mean value of running time (milliseconds) over KDD Cup 1999 dataset.
Size | 5,000 | 10,000 | 15,000 | 20,000 | ||||
|
||||||||
Category | Standard deviation | Mean value | Standard deviation | Mean value | Standard deviation | Mean value | Standard deviation | Mean value |
|
||||||||
iORCA random | 16.7 | 697.60 | 8.0 | 1,363.50 | 21.7 | 2,040.60 | 78.5 | 2,725.40 |
iORCA psm | 4.9 | 688.10 | 8.1 | 1,379.10 | 34.6 | 2,098.20 | 26.1 | 2,717.60 |
iHOD random | 39.0 | 216.70 | 96.4 | 368.10 | 54.6 | 405.60 | 33.6 | 458.80 |
iHOD psm | 21.2 | 232.20 | 35.9 | 309.00 | 83.2 | 402.60 | 22.2 | 441.60 |
Standard deviation and mean value of distance calculation times (104 times) over KDD Cup 1999 dataset.
Size | 5,000 | 10,000 | 15,000 | 20,000 | ||||
---|---|---|---|---|---|---|---|---|
Category | Standard deviation | Mean value | Standard deviation | Mean value | Standard deviation | Mean value | Standard deviation | Mean value |
|
||||||||
iORCA random | 5.23 | 504.1 | 1.39 | 1,002.5 | 0.201 | 1,502.5 | 0.332 | 2,003.3 |
iORCA psm | 0.00 | 501.4 | 0.059 | 1,002.0 | 0.048 | 1,502.5 | 0.157 | 2,003.0 |
iHOD random | 30.04 | 124.4 | 56.23 | 197.7 | 28.54 | 209.8 | 20.85 | 223.2 |
iHOD psm | 15.17 | 137.0 | 24.59 | 165.7 | 46.41 | 197.6 | 10.19 | 211.0 |
As other popular outlier detection algorithms like ORCA [
As we all know, HOD and iHOD algorithms detect next outlier after excluding the detected outliers, so that they can detect more real outliers. Is there any possibility that the normal objects are mistakenly detected as outliers?
To analyze easily, we let
The expected outlier degree of
After taking out
The variation of
For ease of analysis, we can set
So that
Because
Obviously, from formula (
Since
As a result,
In other words, the variation of outlier is larger than normal object, so that it is easier to be detected as outlier. However, owing to the complexity of dataset distribution and the limitations of distance-based outlier definition, it remains possible that a very few normal objects may be mistakenly detected.
In this paper, we propose a new distance-based outlier definition to detect hidden outliers and design a corresponding detection algorithm, which excludes detected outliers to reduce their influence and improve the accuracy. Candidate set based method also avoids repeating detecting the whole dataset. Experimental results show that our HOD algorithm is more accurate than ORCA [
Our definition and algorithm are extended from two distance-based outlier detection algorithms: ORCA [
Hidden outlier definition; see details in Definition
Dataset
Distance function
An object of dataset
Number of outliers to detect
Number of neighbors for calculating outlier degree
The
The
Outlier degree of
The set of top-
The
The set of top-
Maximum possible outlier degree of
Hidden outlier candidate set, each object of which has larger maximum possible outlier degree than cutoff value
Cutoff value, equal to the outlier degree of current
A data block, that is, a set of objects
Index, for example,
Index of a data block
Number of segments/partitions
Number of pivots. In this paper, it can only be set as
The authors declare that there are no competing interests regarding the publication of this paper.
Dr. Minhua Lu is the corresponding author. This research was supported by the following Grants: China 863: 2015AA015305; NSF-China: U1301252 and 61471243; Guangdong Key Laboratory Project: 2012A061400024; NSF-Shenzhen: JCYJ20140418095735561, JCYJ20150731160834611, JCYJ20150625101524056, and SGLH20131010163759789; Educational Commission of Guangdong Province: 2015KQNCX143.