A Distributed Algorithm for the Cluster-Based Outlier Detection Using Unsupervised Extreme Learning Machines

Outlier detection is an important data mining task, whose target is to find the abnormal or atypical objects from a given dataset. The techniques for detecting outliers have a lot of applications, such as credit card fraud detection and environment monitoring. Our previous work proposed the Cluster-Based (CB) outlier and gave a centralized method using unsupervised extreme learning machines to compute CB outliers. In this paper, we propose a new distributed algorithm for the CB outlier detection (DACB). On the master node, we collect a small number of points from the slave nodes to obtain a threshold. On each slave node, we design a new filtering method that can use the threshold to efficiently speed up the computation. Furthermore, we also propose a ranking method to optimize the order of cluster scanning. At last, the effectiveness and efficiency of the proposed approaches are verified through a plenty of simulation experiments.


Introduction
Outlier detection is an important issue of data mining, and it has been widely studied by many scholars for years.According to the description in [1], "an outlier is an observation in a dataset which appears to be inconsistent with the remainder of that set of data."The techniques for mining outliers can be applied to many fields, such as credit card fraud detection, network intrusion detection, and environment monitoring.
There exist two primary missions in the outlier detection.First, we need to define what data are considered as outliers in a given set.Second, an efficient method to compute these outliers needs to be designed.The outlier problem is first studied by the statistics community [2,3].They assume that the given dataset follows a distribution, and an object is considered as an outlier if it shows distinct deviation from this distribution.However, it is almost an impossible task to find an appropriate distribution for high-dimensional data.To overcome the drawback above, some model-free approaches are proposed by the data management community.Examples include distance-based outliers [4][5][6] and the density-based outlier [7].Unfortunately, although these definitions do not need any assumption on the dataset, some shortcomings still exist in them.Therefore, in this paper, we propose a new definition, the Cluster-Based (CB) outlier.The following example discusses the weaknesses of the existing model-free approaches and the motivation of our work.
As Figure 1 shows, there are a denser cluster  1 and a sparse cluster  2 in a two-dimensional dataset.Intuitively, points  1 and  2 are the outliers because they show obvious differences from the other points.However, in the definitions of the distance-based outliers, a point  is marked as an outlier depending on the distances from  to its -nearest neighbors.Then, most of the points in the sparse cluster  2 are more likely to be outliers, whereas the real outlier  1 will be missed.In fact, we must consider the localization of outliers.In other words, to determine whether a point  in a cluster  is an outlier, we should only consider the points in , since the points in the same cluster usually have similar characters.Therefore, in Figure 1,  1 from  1 and  2 from  2 can be selected correctly.The density-based outlier [7] also considers the localization of outliers.For each point , they use the Local Outlier Factor (LOF) to measure the degree of being an outlier.To compute the LOF of , we need to find the set of its -nearest neighbors   () and all the -nearest neighbors of each point in   ().The expensive computational cost limits the practicability of the density-based outlier.Therefore, we propose the CB outlier to conquer the above deficiencies.The formal definition will be described in Section 3.
To detect CB outliers in a given set, the data need to be clustered first.In this paper, we employ the unsupervised extreme learning machine (UELM) [8] for clustering.The extreme learning machine (ELM) is a novel technique proposed by Huang et al. [9][10][11] for pattern classification, which shows better predicting accuracy than the traditional Support Vector Machines (SVMs) [12][13][14][15].Thus far, the ELM techniques have attracted the attention of many scholars and various extensions of ELM have been proposed [16].UELM [8] is designed for dealing with the unlabeled data, and it can efficiently handle clustering tasks.The authors show that UELM provides favorable performance compared with the state-of-the-art clustering algorithms [17][18][19][20].
In [21], we have studied the problem of CB outlier detection using UELM in a centralized environment.Faced with the increasing data scale, the performance of the centralized method becomes too limited to meet the timeliness requirements.Therefore, in this paper, we develop a new efficient distributed algorithm for the CB outlier detection (DACB).The main contributions are summarized as follows: (1) We propose a framework of distributed CB outlier detection, which adopts a master-slave architecture.
The master node keeps monitoring the points with large weights on each slave node and obtains a threshold .The slave nodes can use  to efficiently filter the unpromising points and accelerate the computational efficiency.
(2) We propose a new algorithm to compute the point weights on each slave node.Compared with our previous method in [21], the new algorithm adopts a filtering technique to further improve the efficiency.We can filter out a large number of unpromising points, instead of computing their exact weights.
(3) We propose a new method to optimize the order of cluster scanning on each slave node.With the help of this method, we can obtain a large  early and improve the filtering performance.
The rest of the paper is organized as follows.Section 2 gives brief overviews of ELM and UELM.Section 3 formally defines the CB outlier.Section 4 gives the framework of DACB.Section 5 illustrates the details of DACB.Section 6 analyzes the experimental results.Section 7 gives the related work of outlier detection.Section 8 concludes the paper.

Brief Introduction to ELM.
The target of ELM is to train a single layer feedforward network from a training set with  samples, {X, Y} = {x  , y  }  =1 .Here, x  ∈ R  , and y  is an -dimensional binary vector where only one entry is "1" to represent the class that x  belongs to.
The training process of ELM includes two stages.In the first stage, we build the hidden layer with  nodes using a number of mapping neurons.In detail, for the th hidden layer node, a -dimensional vector a  and a parameter   are randomly generated.Then, for each input vector x  , the relevant output value on the th hidden layer node can be acquired using an activation function such as the Sigmoid function below: Then, the matrix outputted by the hidden layer is In the second stage, an -dimensional vector   is the output weight that connects the th hidden layer with the output node.The output matrix Y is acquired by where . . .
We have known the matrices H and Y.The target of ELM is to solve the output weights  by minimizing the square losses of the prediction errors, leading to the following equation: min If  ≥ , which means H has more rows than columns and it is full of column rank, (6) is the solution for (5).Hence, If  < , a restriction that  is a linear combination of the rows of H:  = H  ( ∈ R × ) is considered.Then,  can be calculated by where I  and I  are the identity matrices of dimensions  and , respectively.

Unsupervised ELM.
Huang et al. [8] proposed UELM to process an unsupervised dataset, which shows good performance in clustering tasks.The unsupervised learning is based on the following assumption: if two points x 1 and x 2 are close to each other, their conditional probabilities (y | x 1 ) and (y | x 2 ) should be similar.To enforce this assumption on the data, we acquire the following equation: where   is the pairwise similarity between x  and x  , which can be calculated by Gaussian function exp(−‖x  − x  ‖ 2 /2 2 ).Since it is difficult to calculate the conditional probabilities, the following can approximate (8): where Tr(⋅) denotes the trace of a matrix, Ŷ is the predictions of the unlabeled dataset, L = D − W is known as graph Laplacian, and D is a diagonal matrix with its diagonal elements   = ∑  =1   .In the unsupervised learning, the dataset X = {x  }  =1 is unlabeled.Substituting ( 9) into ( 5), the objective function of UELM is acquired: where  is a tradeoff parameter.In most cases, (10) reaches its minimum value at  = 0.In [18], Belkin and Niyogi introduced an additional constraint (H)  H = I  .On the base of the conclusion in [8], if  ≤ , we can obtain the following equation: Let   be the th smallest eigenvalues of ( 11) and k  be the corresponding eigenvectors.Then, the solution of the output weights  is given by where k = k  /‖Hk  ‖,  = 2, . . .,  + 1, are the normalized eigenvectors.

Defining CB Outliers
For a given dataset  in a -dimensional space, a point  is denoted by  = ⟨ [1],  [2], . . ., []⟩.The distance between two points  1 and Suppose that there are  clusters  1 ,  2 , . . .,   in  outputted by UELM.For each cluster   , the centroid point   .centrcan be computed by the following equation: Intuitively, in a cluster , most of the normal points are closely around the centroid point of .In contrast, an abnormal point  (i.e., outlier) is usually far from the centroid point and the number of points close to  is quite small.Based on this observation, the weight of a point is defined as follows.
Definition 1 (weight of a point).Given an integer , for a point  in cluster , one uses   () to denote the set of the nearest neighbors of  in .Then, the weight of  is Definition 2 (result set of CB outlier detection).For a dataset , given two integers  and , let  CB be a subset of  with  points.If, ∀ ∈  CB , there is no point  ∈  \  CB that () > (),  CB is the result set of CB outlier detection.
For example, in Figure 1, in cluster  2 , the centroid point is marked in red.For  = 2, the -nearest neighbors of  2 are  4 and  5 .Because  2 is an abnormal point and it is far from the centroid point, dis( 2 .centr, 2 ) is much larger than dis( 2 .centr, 4 ) and dis( 2 .centr, 5 ).Hence, the weight of  2 is large.In contrast, for a normal point  3 deep in the cluster, the distances from  2 .centr to its -nearest neighbors are similar to dis( 2 .centr, 3 ).The weight of  3 is close to 1. Therefore,  2 is more likely to be considered as a CB outlier.

Framework of Distributed CB Outlier Detection
The target of this paper is to detect CB outliers in a distributed environment, which is constituted by a master node and a number of slave nodes.The master node is the coordinator, and it can communicate with all the slave nodes.Each slave node reserves a subset of the clusters in  outputted by UELM, and it is the main worker in the outlier detection.
Figure 2 shows the framework of the distributed algorithm for CB outlier detection (DACB) proposed in this paper.When a request of outlier detection arrives, each slave node starts to scan the local clusters.Basically, in a cluster , we need to search the NNs of each point  in  to compute the weight of .Obviously, computing the NNs of all the points is very time-consuming.Therefore, in our DACB, we propose a filtering method to accelerate the computation.Specifically, each slave node keeps tracking the local top- points with the largest weights in the scanned points and  sends them to the master node.From the received points, the master reserves the global top- points with the largest weights.We choose the smallest one from these  weights as a threshold .The master broadcasts  to the slave nodes to efficiently filter the unpromising points (the detailed filtering method is described in Section 5.1).On each slave node, if there emerges a point with weight larger than , the point will be sent to the master.The master termly updates the global top- points and the threshold .At last, the  points stored in the master node are the CB outliers.

The Computation of Point Weight.
In this section, we introduce the techniques to compute the point weights on each slave node.According to Definitions 1 and 2, to determine whether a point  in a cluster  is an outlier, we need to search the -nearest neighbors (NNs) of  in .In order to accelerate the NN searching, we design an efficient method to prune the searching space.For a cluster , suppose that the points in  have been sorted according to the distances to the centroid point in an ascending order.For a point  in , we scan the points to search to its NNs.Let  temp  () be the set of  points that are the nearest to  from the scanned points, and let  dis temp () be the maximum value of the distances from the points in  temp  () to .Then, the pruning method is described as follows.
Theorem 3.For a point  in front of , if dis(, .centr) < dis(, .centr)− dis  (), the points in front of  and  itself cannot be the NNs of .
Furthermore, after the weights of some points have been computed, we send the current top- points with the largest weights to the master node to obtain the threshold  (mentioned in Section 4).We can use  to filter the points that cannot be the CB outliers, instead of searching the exact NNs.The detailed method is stated as follows.Proof.If  and   are the same point, the theorem can be proven easily.Otherwise, dis(, ) < dis(,   ).Using the triangle inequality, dis(, .centr) > dis(, .centr)−dis(, ) > dis(, .centr) − dis(,   ).The theorem is proven.
Proof.If (17) holds, the weight upper bound of  is smaller than or equal to  according to Theorem 4. Therefore,  is not a CB outlier.
Algorithm 2 shows the process of CB outlier detection on each slave node.For each cluster   in   , the points in   are sorted according to the distances to the centroid point in an ascending order (line (2)).Then, we scan the points in reversed order (line (3)), since the points far from the centroid point are more likely to have large weights and a good threshold  can be obtained early.For each scanned point , we visit the points from  to both sides to search the NNs (line ( 8)).If the visited point  meets Theorem 3, a number of points can be erased from the visited list since they cannot be the NNs of  (lines (10)-( 13)).If the current NNs of  meet Corollary 5,  is not an outlier, so we do not further search its NNs (lines ( 16)-( 18)).After all points are visited, we send  to the master node if  is still not filtered out (lines (19) and ( 20)).At last, the  points on the master node are the CB outliers.

The Order of Cluster Scanning.
As Section 4 describes, each slave node needs to visit the local clusters one by one, and it computes the weights of points in each visited cluster.Meanwhile, the global top- points with the largest weights thus far are kept in order to obtain the threshold .Clearly, the filtering effectiveness can be significantly improved if we obtain a large  early, whereas a large  is unlikely to be obtained if we visit the clusters in a random order, because we cannot guarantee that the early scanned points have large weights.As a consequence, we design a new ranking method for the clusters.Figure 5 illustrates the main idea of the ranking method.
A common idea is that points with large weights possibly emerge in a sparse cluster (the distance between every two points is large).However, this idea does not work well in many cases.For example, in Figure 5(a), the distances from the points to the centroid point in the sparse cluster are almost identical, and thus the weights of these points are not large.However, in Figure 5(b), in the dense cluster, although most of the points are very close, two abnormal points  1 and  2 are far from the centroid point, whose weights are large and Input.The cluster set   reserved on slave node , integers , , the threshold  Output.The CB outliers in  (1) for each cluster   in   do (2) Sort the points in   according to the distances to the centroid point in ascending order; (3) Scan the points in reversed order; (4) for each scanned point  do (5) Initialize a heap ; // to reserve the current NNs of  (6) dis temp = ∞; // the largest distance from the points in  to  (7) boolean _ = true; (8) V i s i tt h ep o i n t sf r o m to the both sides to search 's NNs; (9) for each visited point  do (10) if  is before  and dis(, .centr) < dis(, .centr) −  dis temp then (11) Erase the points before  from the visited list; (12) else if  is behind  and dis(, .centr) > dis(, .centr) +  dis temp then (13) Erase the points behind  from the visited list; ( 14) else (15) U p d a t e  and  dis temp ; (16) if  meets the condition of Corollary 5 then (17) _ = false; ( 18) break (19) if _ then (20) S e n d to the master node to update ; Algorithm 2: CB outlier detection on each slave node.suitable for the  computation.From the observation in Figure 5(b), we can see that the distances of the points with large weights to the centroid point are large and quite different with those the other points.Besides, note that we only need  large weights to compute .based on the description above, we propose the following ranking method.
Definition 6 (the outlier factor of a cluster).For a cluster , suppose that the points have been sorted according to the distance to .centr in a descending order.One uses dis  0 to denote the distance between the first point and .centr and dis   to denote the distance between the ( + 1)th point and .centr.Then, the outlier factor of  is For example, in Figure 5(a),  = 2, which means we consider at most 2 points in a cluster to obtain .Thus, we sort the points according to the distance to  1 .centr in a descending order, and we obtain the top-3 points,  1 ,  2 , and  3 .Then, we compute the distance between the first point  1 and  1 .centr:dis( 1 ,  1 .centr),and the distance between the third point  3 and  1 .centr:dis( 3 ,  1 .centr).The outlier factor of  1 is OF( 1 ) = dis( 3 ,  1 .centr)/dis( 1 ,  1 .centr)= 0.95.Similarly, we get OF( 2 ) = dis( 3 ,  2 .centr)/dis( 1 ,  2 .centr)= 0.5.Comparing the two outlier factors, we can see that OF( 1 ) is large, which means many points are far from  1 .centrand they have similar distances to  1 .centr.Therefore, we can hardly know whether there are points with large weights in  1 .Conversely, the value of OF( 2 ) is small.We can assert that, in  2 , most of the points closely surround the centroid point and there is at least one abnormal point (e.g.,  1 ,  2 ) far from  2 .centr.The weights of these abnormal points are large, and they contribute to selecting a large .
As a consequence, we preferentially visit the clusters with small outlier factors, instead of visiting in a random order.This method helps us to obtain a large  early and improves the filtering effectiveness.

Experimental Evaluation
6.1.Method to Define Distance between Points.We use the method in [4] to define the distance between points in practical applications.First, we use a point with several measurable attributes to represent an object in the real world.All the objects can be mapped into European space.Then, for point  1 representing object  1 and point  2 representing object  2 , we use the distance between  1 and  2 to measure the difference between  1 and  2 .Thus, a point with large distances to others is more likely to be an outlier.

Result Analysis for the Centralized Algorithm.
In this section, we first evaluate the performance of the proposed algorithm in a centralized environment using a PC with an Intel Core i7-2600 @3.4 GHz CPU, 8 GB main memory, and 1 TB hard disk.A synthetic dataset is used for the experiments.In detail, given the data size , we generate /1000 − 1 clusters and randomly assign each of them a center point and a radius.In average, each cluster has 1000 points following Gaussian distribution.At last, the remaining 1000 points are scattered into the space.We implement the proposed method to detect CB outliers (DACB) using JAVA programming language.A naive method (naive) is also implemented as a comparing algorithm, where we simply search each point's NNs and compute its weight.In the experiments, we are mainly concerned with the runtime to represent the computational efficiency and the point accessing times (PATs) to indicate the disk IO cost.The parameters' default values and their variations are shown in Table 1.
As Figure 6(a) shows, DACB is much more efficient than the naive method because of the pruning and filtering strategies proposed in this paper.With the increase of , we need to keep tracking more neighbors for a point, so the runtime of the naive method and the DACB becomes larger.Figure 6(b) shows the effect of  on the PATs.For the naive method, each point needs to visit all the other points in the cluster to find its NNs.Hence, PATs are large.In contrast, for DACB, a point does not have to visit all the other points (Theorem 3), and a large number of points can be filtered out, instead of finding their exact NNs (Corollary 5).Therefore, the PATs are much smaller.
Figure 7 describes the effect of .As  increases, more outliers are reported.Thus, the runtime of the naive method and the DACB becomes larger.The effect on the PATs is shown in Figure 7(b), whose trend is similar to that in Figure 6(b).Note that the PATs of DACB increase slightly with , whereas the PATs of the naive method keep unchanged.
In Figure 8, with the increase of the dimensionality, a number of operations (e.g., computing the distance of two points) become more time-consuming.Hence, the time cost of the two methods becomes larger.But the variation of the dimensionality does not affect the PATs.The effect of the data size is described in Figure 9. Clearly, with the increase of the data size, we need to scan more points to find the outliers.Therefore, both of the runtime and the PATs are linear to the data size.

Result Analysis for the Distributed Algorithm.
In this section, we further evaluate the performance of the proposed algorithm for distributed outlier detection.In the experiments, we mainly consider the time cost and the network transmission quantity (NTQ).The data size and the cluster scale are shown in Table 2.The other parameters settings are identical to those described in Section 6.2.
Figure 10 shows the effect of parameter .The curve "with SOO" represents the DACB algorithm.The curve "without SOO" represents a basic distributed algorithm for outlier detection without the Scanning Order Optimization (SOO) method described in Section 5.2.As we can see, as the value of  increases, both algorithms cost more time (the reason has been discussed in Section 6.2).But  is not sensitive to , and thus NTQ changes very slightly.Comparing the two algorithms, we can see that, with the help of SOO, a large  can be obtained early and a lot more points can be filtered out efficiently.As a result, we can reduce the network overhead and improve the computing efficiency.
In Figure 11, we evaluate the effect of parameter .With the increase of , we consider more points as outliers.Thus, the threshold  becomes smaller and the filtering performance decreases.On the other hand, since a large  means that more points will be transmitted to the master node to compute , NTQ also increases.Figure 12(a) evaluates the effect of the dimensionality on the time cost, which shows the same trend as the result in Figure 8(a).In Figure 12(b), we test the effect of the dimensionality on NTQ.Clearly, each slave node needs to send the local top- points with the largest weights to the master node.The transmission contents include points' ID, the values of all the dimensionality, and the weights.Therefore, more dimensionality leads to higher transmission quantity.
As Figure 13(a) shows, with the increase of the data size, more points need to be scanned to find the outliers; thus, the time cost becomes larger.We can also see that, in Figure 13(b), NTQ increases with the data size, but the change is very small.This is because the value of  becomes stable after a certain amount of calculation, and it is not sensitive to the data size.
The effect of the cluster scale is tested in Figure 14.As more slave nodes are used, the workload on each node becomes smaller, and thus the computing speed is improved.However, to compute , the master node needs to collect  points from each slave node.Therefore, NTQ increases with the cluster scale.Note that, for the dataset with 10 7 points, NTQ is still maintained at a KB level.The network overhead of DACB is acceptable.

Related Work
Outlier detection is an important task in the area of data management, whose target is to find the abnormal objects in a given dataset.The statistics community [2,3] proposed the model-based outliers.The dataset is assumed to follow a distribution.An outlier is the object that shows obvious deviation from the assumed distribution.Later, the data management community pointed out that building a reasonable distribution is almost an impossible task for high-dimensional datasets.To overcome this weakness, they proposed several model-free approaches [22], including distance-based outliers [4][5][6] and density-based outliers [7].
A number of studies focus on developing efficient methods to detect outliers.Knorr and Ng [4] proposed the wellknown nested-loop (NL) algorithm to compute distancebased outliers.Bay and Schwabacher [23] proposed an improved nested-loop approach, called ORCA.The approach efficiently prunes the searching space by randomizing the dataset before outlier detection.Angiulli and Fassetti [24] proposed DOLPHIN, which can reduce the disk IO cost through maintaining a small subset of the input data in main memory.Several researchers adopt the spatial indexes to further improve the computing efficiency.Examples include R-tree [25], M-tree [26], and grids.However, the performance of these methods is quite sensitive to the dimensionality.To improve the computing efficiency, some researchers attempt to use a distributed or parallel method to detect outliers.Examples include [27][28][29].

Conclusion
In this paper, we studied the problem of CB outlier detection in a distributed environment and proposed an efficient algorithm, called DACB.This algorithm adopts a masterslave architecture.The master node monitors the points with large weights on each slave node and computes a threshold.On each slave node, we designed a pruning method to speed up the NN searching and a filtering method that can use the threshold to filter out a large number of unpromising points.We also designed an optimization method for cluster scanning, which can significantly improve the filtering performance of the threshold.Finally, we evaluated the performance of the proposed approaches through a series of simulation experiments.The experimental results show that our method can effectively reduce the runtime and the network transmission quantity for distributed CB outlier detection.

Theorem 4 .
In the NN searching of  in a cluster ,    () is the set of the current -nearest neighbors of  from the scanned points.  () is the set of the exact -nearest neighbors.One sorts the points in   () and    () according to the distances to  in an ascending order, respectively.Then, for the th point in   ():  and the th point in    ():   , one asserts that dis(, .centr) > dis(, .centr) − dis(,   ).

Figure 5 :
Figure 5: The ranking method of clusters.

7 )Figure 6 :
Figure 6: The effect of  for the centralized algorithm.

Figure 7 :Figure 8 :
Figure 7: The effect of  for the centralized algorithm.

Figure 9 :Figure 10 :
Figure 9: The effect of data size for the centralized algorithm.

Figure 11 :Figure 12 :
Figure 11: The effect of  for the distributed algorithm.

Figure 13 :Figure 14 :
Figure 13: The effect of data size for the distributed algorithm.

Table 1 :
Parameter settings for the centralized algorithm.

Table 2 :
Parameter settings for the distributed algorithm.