Outlier detection is an important data mining task, whose target is to find the abnormal or atypical objects from a given dataset. The techniques for detecting outliers have a lot of applications, such as credit card fraud detection and environment monitoring. Our previous work proposed the ClusterBased (CB) outlier and gave a centralized method using unsupervised extreme learning machines to compute CB outliers. In this paper, we propose a new distributed algorithm for the CB outlier detection (DACB). On the master node, we collect a small number of points from the slave nodes to obtain a threshold. On each slave node, we design a new filtering method that can use the threshold to efficiently speed up the computation. Furthermore, we also propose a ranking method to optimize the order of cluster scanning. At last, the effectiveness and efficiency of the proposed approaches are verified through a plenty of simulation experiments.
Outlier detection is an important issue of data mining, and it has been widely studied by many scholars for years. According to the description in [
There exist two primary missions in the outlier detection. First, we need to define what data are considered as outliers in a given set. Second, an efficient method to compute these outliers needs to be designed. The outlier problem is first studied by the statistics community [
As Figure
Example of outliers.
To detect CB outliers in a given set, the data need to be clustered first. In this paper, we employ the unsupervised extreme learning machine (UELM) [
In [
We propose a framework of distributed CB outlier detection, which adopts a masterslave architecture. The master node keeps monitoring the points with large weights on each slave node and obtains a threshold
We propose a new algorithm to compute the point weights on each slave node. Compared with our previous method in [
We propose a new method to optimize the order of cluster scanning on each slave node. With the help of this method, we can obtain a large
The rest of the paper is organized as follows. Section
The target of ELM is to train a single layer feedforward network from a training set with
The training process of ELM includes two stages. In the first stage, we build the hidden layer with
Then, the matrix outputted by the hidden layer is
In the second stage, an
We have known the matrices
If
If
Huang et al. [
Since it is difficult to calculate the conditional probabilities, the following can approximate (
In the unsupervised learning, the dataset
Let
If
Again, let
(
(
and calculate the output matrix
(
(
(
(
(
(
(
clusters using the
vector of cluster index for all the points.
(
For a given dataset
Intuitively, in a cluster
Given an integer
For a dataset
For example, in Figure
The target of this paper is to detect CB outliers in a distributed environment, which is constituted by a master node and a number of slave nodes. The master node is the coordinator, and it can communicate with all the slave nodes. Each slave node reserves a subset of the clusters in
The framework of DACB.
When a request of outlier detection arrives, each slave node starts to scan the local clusters. Basically, in a cluster
In this section, we introduce the techniques to compute the point weights on each slave node. According to Definitions
For a cluster
For a point
For a point
Similarly, for a point
Example of
Furthermore, after the weights of some points have been computed, we send the current top
In the
If
For a point
If (
By utilizing Corollary
Example of filtering.
Algorithm
(
(
in ascending order;
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
As Section
The ranking method of clusters.
A sparse cluster
A dense cluster
A common idea is that points with large weights possibly emerge in a sparse cluster (the distance between every two points is large). However, this idea does not work well in many cases. For example, in Figure
For a cluster
For example, in Figure
As a consequence, we preferentially visit the clusters with small outlier factors, instead of visiting in a random order. This method helps us to obtain a large
We use the method in [
In this section, we first evaluate the performance of the proposed algorithm in a centralized environment using a PC with an Intel Core i72600 @3.4 GHz CPU, 8 GB main memory, and 1 TB hard disk. A synthetic dataset is used for the experiments. In detail, given the data size
Parameter settings for the centralized algorithm.
Parameter  Default  Range of variation 


20  15–35 

30  20–40 
Data size 
1  0.5–2.5 
Dimensionality 
3  3–15 
As Figure
The effect of
Time cost versus
Point accessing times versus
Figure
The effect of
Time cost versus
Point accessing times versus
In Figure
The effect of dimensionality for the centralized algorithm.
Time cost versus
Point accessing times versus
The effect of data size for the centralized algorithm.
Time cost versus
Point accessing times versus
In this section, we further evaluate the performance of the proposed algorithm for distributed outlier detection. In the experiments, we mainly consider the time cost and the network transmission quantity (NTQ). The data size and the cluster scale are shown in Table
Parameter settings for the distributed algorithm.
Parameter  Default  Range of variation 

Data size ( 
1  0.5–2.5 
Number of slave nodes  10  6–14 
Figure
The effect of
Time cost versus
Network transmission quantity versus
In Figure
The effect of
Time cost versus
Network transmission quantity versus
Figure
The effect of dimensionality for the distributed algorithm.
Time cost versus dimensionality
Network transmission quantity versus dimensionality
As Figure
The effect of data size for the distributed algorithm.
Time cost versus data size
Network transmission quantity versus data size
The effect of the cluster scale is tested in Figure
The effect of number of slave nodes for the distributed algorithm.
Time cost versus number of slave nodes
Network transmission quantity versus number of slave nodes
Outlier detection is an important task in the area of data management, whose target is to find the abnormal objects in a given dataset. The statistics community [
A number of studies focus on developing efficient methods to detect outliers. Knorr and Ng [
In this paper, we studied the problem of CB outlier detection in a distributed environment and proposed an efficient algorithm, called DACB. This algorithm adopts a masterslave architecture. The master node monitors the points with large weights on each slave node and computes a threshold. On each slave node, we designed a pruning method to speed up the
The authors declare that there are no conflicts of interest regarding the publication of this paper
This work is supported by the National Natural Science Foundation of China under Grants nos. 61602076 and 61371090, the Natural Science Foundation of Liaoning Province under Grant no. 201602094, and the Fundamental Research Funds for the Central Universities under Grant no. 3132016030.