Locality-Based Visual Outlier Detection Algorithm for Time Series

. Physiological theories indicate that the deepest impression for time series data with respect to the human visual system is its extreme value. Based on this principle, by researching the strategies of extreme-point-based hierarchy segmentation, the hierarchy-segmentation-based data extraction method for time series, and the ideas of locality outlier, a novel outlier detection model and method for time series are proposed. The presented algorithm intuitively labels an outlier factor to each subsequence in time series such that the visual outlier detection gets relatively direct. The experimental results demonstrate the average advantage of the developed method over the compared methods and the efficient data reduction capability for time series, which indicates the promising performance of the proposed method and its practical application value.


Introduction
Time series, widely existing in various applications [1] such as sensor network data collection [2][3][4], credit card fraud data [1], and environment monitoring data [2,[5][6][7][8], is one of the major types of big data.In fact, time series is an ordered sequence of observed data with respect to time; highly intuitive and usually most of the desired key information can be directly obtained from the different variations or distributions via the human visual system.On the other hand, physiological experiments have demonstrated that the deepest impression for sequence data with respect to the human visual system is its extreme value [9,10], so it intuitively inspired us to study the visual outlier detection method with respect to the outlier events based on this principle.
Generally, there are three types of outliers: collective outliers, point outliers, and contextual outliers [8].Identification of outliers can lead to the discovery of significant clues and has practical applications in various fields, such as financial risk management [1,10], anomaly detection [5], and disaster alarm in environment monitoring [2,[5][6][7].
In the past few decades, this issue has been addressed in academia and attracted an increasing amount of attention.Some of the outlier detection approaches are based on notably different assumptions, intuitions, and models and also differ substantially in the scaling, range, and even meaning of values [11].Furthermore, some other methods are developed on the basis of the technologies themselves such as the cluster-based detection method [4], the immunologybased detection method [12], and the SVM-based detection method [8].Regardless of any type of time series, there always exist many valuable characteristics in most locations, such as the locality features neighboring the real outlier, the locality characteristic maybe more meaningful than the global information.For example, when a doctor diagnoses a disease based on the electrocardiogram, the ECG's local information is enough for finding the lesion.However, most of the aforementioned methods are unable to detect the outliers in time series locally and visually.
Although most of the previous researches [1][2][3][4][5][6][7][8] have addressed the outlier detection in time series, there still exist some challenges to undertake; for example, different time series appear out of synchronism, results of the traditional similarity calculation method are no longer available, the periodical outlier in time series is hard to detect, the determination of the outlier threshold is unreasonable, and so on.In this paper, a hierarchy-segmentation-data-extraction-based outlier detection method is proposed.Our scheme integrates the investigation on the following to achieve relatively high effectiveness and efficiency: (a) studying the extreme-point discriminating strategy based on hierarchy segmentation; (b) the hierarchy-segmentation-based data extraction (HSDE) method for time series; (c) the outlier detection model; and (d) the locality outlier detection algorithm.Specific to the outlier identification, here, unlike all previous attempts to solve this problem, the proposed method depends on the departure from the location of the objects from its expected hierarchy rather than its global structure.Additionally, being labeled as an "outlier" here is not an either/or proposition.Instead, the proposed method assigns a local outlier factor to each detected subsequence, and the factor is the level to whether the object is outlying.Our major contributions are detailed as follows.
(1) The relation between the distribution characteristic in time series and the recognition mechanism associated with the human visual system is addressed, and the HSDE-based visual outliers detection method distinguishes the outliers directly without requiring previously observed training data.
(2) The locality-based outlier detection idea is successfully transferred into the realization for data mining of time series; in contrast, the previous LOF algorithms are only applicable to numerical data.
(3) A novel hierarchy-segmentation-based data extraction method for time series and its associated outlier detection model are presented.
The remainder of this paper is organized as follows.The related works are introduced in Section 2. In Section 3, we describe the new hierarchy-segmentation-based strategy and the related data extraction method.In Section 4, we improve the key ideas in LOF algorithm and derive the framework of the HSDE-based outlier detection model and algorithm.Promising experimental results on benchmarking datasets are presented in Section 5, which are followed by the concluding remarks in Section 6.

Related Works
A wide variety of studies investigating outlier detection have been examined; various outlier detection methods, such as global versus local, scoring versus labeling, and supervised versus unsupervised, were proposed [13].Most of them are developed from different identification ideas of outliers, respectively, such as similarity measurement or dissimilarity measurement.Due to the specificity of time series, only a small part of detection methods are able to detect the outliers in time series.
As to the distance-based outlier detection methods in time series, there are four main dissimilarity measurements and their related evolution works, such as Euclidean distance (ED), dynamic time warping (DTW), symbolic aggregate approximation (SAX), and extended symbolic aggregate approximation (Extended-SAX) and their derived outlier detection schemes.The associated outlier detection methods that are developed from the four types of distance all inherit their own advantages or disadvantages without exception.ED is well known for its simple computation and sound universality, but it can only carry out the time series of equal length and cannot recognize the variation trend of time series [13,14].DTW can well overcome the first disadvantage of ED and can support the time warping of time series.However, its computing complexity and time complexity are high, which limits its application range.Chiu et al. [15] proposed the symbolic aggregate approximation (SAX) approach.SAX firstly symbolizes the time series and then carries out data similarity measure of the symbolic data.This method was easy to use and independent of specific experimental data.With relatively strong universality, the approach has been widely used [16][17][18].However, the essence of similarity measure in SAX is based on ED or DTW, so it is inevitable to inherit their disadvantages.
Naess and Gaidai [9] developed a feature space-based outlier detection method based on SAX.The feature spacebased outlier detection method can reduce the number of features effectively and compress the scale of time series.It was easy to miss some important features in the process of reduction.And also, it was unable to detect the outliers in time series visually.Extended symbolic aggregate approximation (Extended-SAX) [19] was developed from SAX, and an outlier detection method was also presented.Extended-SAX needed to depend on the piecewise aggregate approximation (PAA) representation for dimensionality reduction that minimizes dimensionality by the mean values of equal sized subsequences.Undoubtedly, the final distance measurement in Extended-SAX also depended on ED or DTW.Furthermore, the PAA still needed more time to strengthen the computation complexity.The outlier detection method based on Extended-SAX is unable to detect the outliers in time series visually.More so, all of the above methods realized the outlier detection through the so-called "distance measurement" rather than the locality distribution characteristic of time series.
This paper also uses DTW as the dissimilarity measurement.The HSDE-based outlier detection scheme is also inspired by the strategy of the local outlier factor LOF [19] and its incremental LOF algorithm [20], whereby we address the collective outlier detection by DTW-based methods and aim to enumerate the desired outliers in time series visually via the locality distribution characteristics of data points.Particularly, the outliers are visually enumerated to detect by the human visual system.Finally, comparison studies are also performed with the feature space-based outlier detection method [9] and the Extended-SAX-based outlier detection method [19], and the analysis results are also presented.
In this, the "hierarchy value" describes the importance level of   in time series.The larger the hierarchy value, the higher the importance of   in time series.Therefore, the hierarchy value is also entirely used to represent the importance level of   in time series.
Based on the characteristic of the hierarchy of different data points in time series, the hierarchy-segmentationbased data extraction (HSDE) for time series is proposed, which includes stages such as extreme-pointed discriminating (EPD), hierarchy marking of time series (HM) and hierarchy segmentation series accessing (HSSA).
(2) Hierarchy Marking of Time Series.Hierarchy marking of time series (HM) function is discussed in this section.EPD function is utilized for discrimination of extreme points.The pseudocode of HM can be expressed in the diagram below.Because  is always a positive integer, here, a predetermined parameter Max  is defined as the upper value of , which After HM processing is done, the hierarchy values of the obtained HM and   correspond, respectively.
(3) Hierarchy Segmentation Series Accessing.The process of hierarchy segmentation series accessing (HSSA) function, along with the original time order in , selects the data points that satisfy |HM()| = Max  in terms of the HM.The selected data points are reconstructed as a new hierarchy segmentation series (HSS).The pseudocode of HSSA function is expressedin Pseudocode 3.
In fact, after HSSA processing is done, the HSS corresponds to  after data reduction, while attempting to maintain as much key information as possible.

Hierarchy-Segmentation-Based Data Extraction.
In fact, the number of the new obtained time series HSS is far less than that of the original time series .However, before and after the HSSA processing, the information is likely to remain similar without further changes.Therefore, data compression has been conducted simultaneously.What received more attention is that the new time series reduction HSS can successfully represent the original time series  only if the hierarchy value in HM is properly selected.As a result, we call it the hierarchy-segmentation-based data extraction (HSDE) method.

The Local Outlier Factor and Detection Principle.
In this section, our goal is to evaluate the practical applications value of HSDE-based methods.Inspired by the method developed in [19,20], we extend the main idea of the local outlier factor (LOF) into data mining of time series, wherein the LOF is a local level that depends on how isolated the object is with respect to the surrounding neighborhoods.Moreover, our final goal is to assign an outlier factor (the level to which the object is outlying) to each subsequence in time series.Undoubtedly, this paper implements some key improvements of the steps in the previous algorithms [19,20] and maintains some of the same locating outliers detection principles, such as the -distance of an object , -distance neighborhood of an object, and reachability distance of an object  with respect to object  [19,20].The distance in distance of an object  is redefined as DTW(, ) between  and an object such that (1) for at least  objects it holds that DTW(,   ) < DTW(, ) and (2) for at most  − 1 objects it holds that DTW(,   ) < DTW(, ). is a positive integer which always represents the number of objects and must be predetermined by experimentation.Additionally, the -distance neighborhood of  contains each object whose distance from  is not greater than the -distance; that is, These objects  are called the -nearest neighbors of .The reachability distance of object  with respect to object  is defined as follows.Namely, it is defined as the following formula: The set of the reachability distances of an object  is denoted as  reach-dist  ().The smaller the value of | reach-dist  ()| is, the lower the number of the objects  in reachability distance of an object  with respect to object  is.In contrast, the larger value indicates that the object  has more neighborhoods and also falls inside more locations of the reachability distance of other objects.For example, given a temporary time series, which is illustrated in Figure 1, it is clear that the object  1 is not located inside any other 2-distance neighborhoods and is far away from the others.Therefore, the object  4 falls inside the reachability distance of the others.
Further, to improve the main principles developed in the algorithms [19,20] to be suitable for handling time series, we continue to define two additionally important notions, the local reachability density of an object  and the local outlier factor of an object , as shown in formulae (3) and (4), respectively.Definition 3. The local reachability density   () of an object  is defined as the following formula: where   () is a level of the local density of the object .  () is the ratio of the number of the reachability distances of the object  and the total sum of the reachability distance of the objects.Obviously, the definition is subject to 0 ≤   () ≤ 1.
Definition 4. The local outlier factor is defined as the following formula: The local outlier factor (LOF) of each   ∈  is computed by formula (4) and is ordered in either an ascending or a descending order.As a result, the range of the outlier factor for each subsequence,   ∈ , is clear.

The Outlier Detection Model.
Based on the above studies, this paper presents an outlier detection model for time series that is shown in Figure 2. The outlier detection process mainly includes the following stages: the hierarchy-segmentationbased data extraction (HSDE) method for time series, the computation of -distance, the computation of reachability distance neighborhoods and the local reachability density, the computation of the outlier factor, and labeling of the outlier sequence.Here, each stage is strictly conducted in terms of the aforementioned details.

The Proposed Method.
Based on the proposed model in Section 4.2, the HSDE-based outlier detection method is summarized as shown in Pseudocode 4.
In this, the computation of the hierarchy-segmentationbased data extraction (HSDE) for time series requires the most time, while the main cost of time complexity is the double loop in the HM function, and the time complexity is (lognMax ).It is clear that the time complexity is similar in the other stages of the HSDE-based outlier detection algorithm and is no more than (lognMax ), in which each stage is conducted sequentially.Therefore, the total time complexity of the HSDE-based outlier detection algorithm is (lognMax ).

Experiment Arrangement.
We arrange several experiments on three datasets: including Keogh Data [21], ECG Data [22], and Ma Data [23], respectively.The experiments aim to validate both the detection capability and its effectiveness and efficiency.All of the experiments are realized using Matlab R2010b.

Evaluation Indices.
This study also inherits the traditional indices [24], including false negative rate and false positive rate, and they are redefined as the following formulae, respectively: where TP, FP, FN, and TN are expressed in Table 1, where false negative rate ( FalsePositive ) denotes the ratio between the number of normal items wrongly recognized as outliers and the total number of the detected outliers, which is defined and formalized as formulation (5); a smaller false negative rate also indicates a higher outlier detection performance; false positive rate ( FalseNegative ) is expressed as the ratio between the misdetection outliers and the total number of the real outliers, which is shown as that formalized in formula ( 6); a lower rate implies a higher detection accuracy and prominent efficiency.

Result Analysis.
Three benchmarking time series datasets, Keogh Data [21], ECG Data [22], and Ma Data [23], are employed to the experiment.Experimental comparisons between different detection methods, including the feature space-based method [9], the Extended-SAX-based method [19], and the proposed method in this paper, are also done in terms of the evaluation indices with the best parameters in each method.We compared all three approaches on the same tasks: (1) the first is the training data, with several slightly noisy data points; (2) the second is a time series containing a synthetic "outlier," which was created with the same parameters as the training subsequence [25]; (3) to guarantee the fairness of comparison results, the time series datasets are user-partitioned into equal subsequences to highlight the outliers and degrade the complexity of data processing; and (4) the best parameters in each method are selected through several training experiments.
Experiment 1 (Keogh Data).Keogh Data [21] is the experiment time series by Keogh, which is generated by several randomized procedures and whose length is 800, in which an additive Gaussian noise with an average value of "0" and a standard deviation of "1" is added.In addition, there exist outlier events in the range between the 400th and the 432nd data points in order to concentrate the outlier data points.
Here, the 800 time series data points are separated into 20 subsequences, and each subsequence is 40 data points long.These 20 subsequences are reconstructed as a new time series dataset that is denoted as  = { 1 , . . .,  11 , . . .,  20 }, which is implemented for the experiment.In the range between the 400th and 432nd data points, the corresponding 11th (e.g.,  11 ) subsequence is the real outlier.In the feature spacebased outlier detection method, the number of subsequences is 20;  in -distance is 6; the parameter  in Extended-SAX-based outlier detection method is 4; and Max  in the proposed outlier detection method is 2. The experimental results are shown in Figure 3 and Table 2, wherein the threshold value is a user-predefined parameter based on several experienced observations.Figure 3(a) shows the generalization 6-distance neighborhoods of each subsequence in  by the feature spacebased outlier detection method.It is clear that the value of  19 is the maximum, the value of  11 is relatively smaller, and the value of  5 is the minimum.Figure 3(b) shows the generalization outlier factor of each subsequence in  by the Extended-SAX-based outlier detection method.It is clear that the values of  3 and  7 are the maximum and their values are nearly equal.This case also indicates that the outlier factors of  3 and  7 are the maximum.In contrast, the accumulated distance of  11 is relatively smaller and less prominent.The accumulative distance of  11 is neither the maximum nor the minimum one.Figure 3(c) shows the LOF of each subsequence in  by the proposed outlier detection method.It is clear that the value of  11 is the maximum.In this study, this case indicates that the LOF of  11 is the highest one, and it is consistent with the real time series.
On the other hand, an experimental comparison is shown in Table 2.The comprehensive performance of the proposed method is superior to the other compared ones.In Table 2, the total number of the real outliers is small, regardless of whether they are detected or not, which causes the evaluation indices of  FalseNegative and  FalsePositive to be extremely high or low according to the definitions.Comparatively, the proposed method is prominent.
According to the above findings, the generalizations of the 6-distance neighborhoods method and the Extended-SAX-based outlier detection method are unable to find the outlier subsequence.The generalization of 6-distance neighborhoods method introduced  3 and  19 false alarms of approximatively equal magnitude, and the Extended-SAX-based outlier detection method introduced  3 and  7 false alarms of approximatively equal magnitude.Unlike the other two compared approaches, the proposed outlier detection method shows a strong peak for the range of the outlier subsequence, as it successfully detected the outlier  11 .Although  3 ,  7 , and  19 are not real outliers, the proposed outlier detection method also shows  3 and  19 at a relatively high outlier "level," but no more than that of the real outlier  11 .This situation indicates that the proposed outlier detection algorithm might have a practical application value.Although Figure 3 just shows the results at "2-distance," similar results may be observed at other hierarchies, and some outlier patterns might exist at different "hierarchy." Experiment 2 (ECG Data).ECG Data [22] is a time series dataset with 3570 data points, in which there exist outlier events in the range between the 2300th and 2500th data points.Here, the ECG data are separated into 25 subsequences in order to highlight the outlier data points, and each subsequence is 150 data points long.These 25 subsequences are created as a new time series dataset  = { 1 , . . .,  16 ,  17 , . . .,  25 }, which is implemented in the experiment.In terms of the real outliers in ECG Data, the 16th subsequence (e.g.,  16 ) and the 17th subsequence (e.g.,  17 ) are the outliers.We compared all three methods under consideration.In the feature space-based method, the segmentation number is 50;  in -distance is 8; the parameter  in Extended-SAX-based outlier detection method is 4; and Max  in the proposed method is 4. The experimental results are shown in Figure 4 and Table 3.In Figure 4, the threshold value is a user-predefined parameter based on the experienced observation.The feature space-based method 100 100 The extended SAX-based method 100 100 The proposed method 0 0 According to the above discussion, the feature spacebased outlier detection method is unable to find the outliers entirely, while introducing several subsequences false alarms of approximatively equal magnitude.The Extended-SAXbased outlier detection method found only one of the real outlier series instead of the two.It is clear that the Extended-SAX-based outlier detection method introduced  19 false alarm.Unlike the other two compared approaches, the proposed outlier detection method shows a strong peak for the range of the real outlier data points by successfully detecting the outliers  16 and  17 .Although the LOF of the normal  24 is no more than that of the outliers, it is regretful that the proposed outlier detection method also shows  24 a relatively higher LOF value.In essence, as seen from Figure 4(c), the corresponding level of  24 is of equal magnitude to the other normal data points without any extreme performance.Through analysis, we found that the reason this was caused is because of the experienced parameter Max  in the proposed method.The length of subsequence is separated and marked by the parameter of Max .This situation results in the locality outlier instead of the global one, which is only outlying in its neighborhoods rather than in the global time series.
Additionally, the experimental comparison is shown in Table 3.The comprehensive performance of the proposed method is superior to the other compared ones.In Table 3, because of similar reasons, the number of real outliers is small; this results in extremely high or low evaluation indices  of  FalseNegative and  FalsePositive .Fortunately, it has no influence on the proposed method with a relatively stronger outlier detection capability.
Experiment 3 (Ma Data).Ma Data [23] includes three pieces of synthetic time series that are generated from a userpredefined stochastic process, respectively; each time series has 1200 data points, wherein  1 () is the normal distribution without outliers and the others of  2 () and  3 () are with an additive Gaussian noise with zero mean and a SDT of 0.1.The outlier event is between the ranges of [600-620] in  2 (), and the outlier events are in the The feature space-based method 100 100 The extended SAX-based method 100 100 The proposed method 0 66.7 The feature space-based method 50 60 The extended SAX-based method 25 50 The proposed method 33.3 33.3 advantages of the proposed method can be summarized as follows: (1) By the studies, the HSDE-based visual outlier detection method does not require previously observed normal data.
(2) The HSDE-based outlier detection visual method can find outliers by enumerating all of the outlier subsequences and even determine the final outliers in terms of intuition.
(3) It is more practical to assign a factor of being an outlier to each hierarchy of the different subsequences in time series, so that the outlier can be detected directly.
(4) The proposed method visually enumerates the outlier subsequence in time series based on its outlier factor.
(5) The results directly present strong visual evidence for monitoring outliers without any data converting.
However, improvements on the proposed method require further study, for example, how to determine the threshold value of outliers by the proposed algorithm itself and lower the higher false alarm ratio as well as handle "each point" in time series and how to utilize the sliding window technology to separate the time series instead of user-conducted separation, which will be investigated in succeeding studies.

Figure 3 :
Figure 3: Experimental results on Keogh Data.

Figure 4 (
Figure 4(a) shows the generalization 8-distance neighborhoods of time series by the feature space-based outlier detection method.It is clear that the 8-distance values of  16 The results using the feature space-based outlier detection method The results using the extended SAX-based outlier detection method The results using the proposed outlier detection method

Table 1 :
The detected results.

Table 2 :
The experimental comparison using Keogh Data.

Table 3 :
The experimental comparison using ECG Data.17 are neither the maximum nor the minimum ones.In contrast, the 8-distance value of  23 is relatively larger, but in fact  23 is not a real outlier.Figure4(b)shows the generalization outlier factor of each subsequence in  by the Extended-SAX-based outlier detection method.It is clear that the accumulated distance value of  17 is the maximum one and that of  19 is the second maximum, whereas the accumulated distance value of  16 is relatively smaller and less prominent.Namely, the other outlier  16 has not been found.Figure4(c) shows the generalization LOF of each subsequence in  by the proposed outlier detection method.It is clear that the LOFs of  16 and  17 are larger than those of the others.Here, this case indicates that the LOF of  16 and  17 is the largest one, and it is consistent with the real time series.

Table 4 :
The experimental comparison using  2 () of Ma Data.

Table 5 :
The experimental comparison using  3 () of Ma Data.