Data streams are continuously generated over time from Internet of Things (IoT) devices. The faster all of this data is analyzed, its hidden trends and patterns discovered, and new strategies created, the faster action can be taken, creating greater value for organizations. Density-based method is a prominent class in clustering data streams. It has the ability to detect arbitrary shape clusters, to handle outlier, and it does not need the number of clusters in advance. Therefore, density-based clustering algorithm is a proper choice for clustering IoT streams. Recently, several density-based algorithms have been proposed for clustering data streams. However, density-based clustering in limited time is still a challenging issue. In this paper, we propose a density-based clustering algorithm for IoT streams. The method has fast processing time to be applicable in real-time application of IoT devices. Experimental results show that the proposed approach obtains high quality results with low computation time on real and synthetic datasets.

Using RFID and conventional sensors in the base of the data collection mechanisms in Internet of Things (IoT) makes the volume of the collected data intensively large. In many cases, the communications and data transfers between the objects are required to enable smart analytics. Such communications and transfers require both bandwidth and energy consumption, which are usually limited resources in real scenarios. Furthermore, the analytics required for such applications is often real-time, and therefore it requires the design of methods which can provide real-time insights [

Multilayer data stream mining model for Internet of Things (adopted from [

Mining data stream is relatively a new area of research in the data mining community. It became more prominent in many applications such as monitoring environmental sensors, social network analysis, real-time detection of anomalies in computer network traffic, and web searches [

Clustering is a remarkable task in mining data stream [

There are different methods for clustering data streams. In clustering methods, data are categorized based on the similarities among objects. The similarity is determined based on distance or density [

In the last few years, many proposals to extend density-based clustering for data stream have been presented [

The density grid-based clustering [

On the other hand, in density-based microclustering [

To mitigate the problem of density microclustering methods, we propose a hybrid density-based method for clustering evolving data streams. Our proposed method uses the advantages of both density grid-based and microclustering methods. We refer to our algorithm as HDC-Stream (hybrid density-based clustering for data stream). HDC-Stream has three steps: in step one, the new data point is either mapped to the gird or merged to an existing minicluster. Minicluster is a concept similar to microcluster which is formed from a grid cell. Second step prunes miniclusters and grids in each pruning time. Last step forms the final clusters from the pruned miniclusters using a modified DBSCAN algorithm.

The main contributions of HDC-Stream are summarized as follows.

In HDC-Stream, instead of searching list of outlier microclusters to find the suitable one, it maps the new data point into the grid cell which saves computation time. This reduces the number of comparisons from

In HDC-Stream, instead of forming a new microcluster for a new data point, which is not placed in any existing microcluster and may be a seed of outlier, the new data point is mapped and kept in the grid until the grid density reaches a predefined threshold. In this case, it is converted to a minicluster.

The experimental results also show that it outperforms two of the well-known existing density microclustering and density grid-based clustering methods in terms of quality and execution time. Furthermore, the experimental results show that HDC-Stream obtains clusters of high quality even when the noise is present.

The remainder of this paper is organized as follows: Section

Clustering is an important task in data stream mining. Recently, a plenty of clustering algorithms have been developed for data streams. These clustering algorithms can be generally grouped into the four following main categories [

A partitioning-based clustering algorithm tries to find the best partitioning for data points in which intraclass similarity is maximum and interclass similarity is minimum. Two of the well-known extensions of

Density-based clustering algorithms have been developed to discover clusters with arbitrary shapes. They find clusters based on the dense areas in a shape. If two points are close enough and the region around them is dense, then these two data points join and contribute to construction of a cluster. DBSCAN [

Due to data streams’ characteristics, the traditional density-based clustering is not applicable. Recently, many density-based clustering algorithms are extended for data streams. The main idea in these algorithms is using density-based method in the clustering process and at the same time overcoming the constraints, which are put by data stream’s nature. Density-based clustering algorithms are categorized into two broad groups called density microclustering and density grid-based clustering algorithms. A comprehensive survey on density-based clustering algorithm on data stream is presented in [

DenStream [

The other important category is density grid-based method. D-Stream [

The neighborhood is within a radius of

MinPts is the minimum number of data points around a data point

For each data point in the data stream, we consider a weight which decreases over time. The initial value of data point is 1. The weight of data point

For a grid

According to the work presented in [

The total weight of all the grids in data space

It means that sum of all data points’ weights has an upper bound of

It is defined as an object for which its overall weight of all

At time

At time

Because the overall weight cannot be more than

A

Is a tuple

This threshold is considered for the sparse grids which do not receive any data for long. In fact, these grids do not have any chance to be converted to dense grids and consequently to

We check all MICs’ weights as well as the weights of all grid cells in a time we call it

HDC-Stream is a hybrid density-based clustering algorithm for evolving data streams. The overall architecture of HDC-Stream algorithm is outlined in Algorithm

Merging or papping (MM-Step): the new data point is added to an existing minicluster or mapped to the grid (lines 5–18 of Algorithm

Pruning grids and miniclusters (PGM-Step): the grids cells as well as miniclusters’ weights are periodically checked in pruning time. The periods are defined based on the minimum time for a minicluster to be converted to an outlier. The grids and the miniclusters with the weights less than a threshold are discarded, and the memory space is released (lines 19–33 of Algorithm

Forming final clusters (FFC-Step): final clusters are formed based on miniclusters which are pruned. Each minicluster is clustered as a virtual point using a modified DBSCAN (lines 34–36 of Algorithm

(1)

(2)

(3)

(4)

(5)

(6)

(7)

(8)

(9)

(10)

(11) Update

(12)

(13)

(14)

(15)

(16) Remove grid

(17)

(18)

(19)

(20)

(21)

(22)

(23) Remove grid

(24)

(25)

(26)

(27)

(28) Remove MIC from

(29)

(30)

(31)

(32)

(33)

(34)

(35) Generate clusters using a modified DBSCAN

(36)

Overall view of HDC-Stream algorithm.

The steps are explained as follows.

When a new data point arrives (Figure

HDC-Stream finds the nearest

If the new data point’s distance to the nearest

Otherwise, the data point has to be mapped into the grid in the outlier buffer.

If the number of data points in grid

If the grid weight

The related grid

MM-Step of HDC-Stream algorithm.

For each

When a clustering request arrives, a variant of DBSCAN algorithm is applied on the set of the online maintained miniclusters to get the clustering result. Each minicluster

A

A

A

In this section, we present the evaluation of HDC-Stream with respect to two existing well-known methods DenStream and D-Stream. We have implemented HDC-Stream as well as the comparative methods in Java. All experiments were conducted on a 2.5 GHz machine with 4 GB memory, running on Mac OS X. In this section, firstly, we describe the datasets and then evaluation measures used for the evaluation of the HDC-Stream algorithm. Detailed experiments on real and synthetic datasets are discussed as well.

For evaluation purposes, the clustering quality, scalability, and sensitivity of the HDC-Stream algorithm on both real and synthetic datasets are used. We generated three synthetic datasets DS1, DS2, and DS3 which are depicted in Figures

Synthetic datasets.

Dataset DS1—10000 data points, 3% noise

Dataset DS2—10000 data points, 4% noise

Dataset DS3—10000 data points, 5% noise

The real dataset used is KDD CUP99 Network Intrusion Detection dataset (all 34 continuous attributes out of the total 42 available attributes are used) [

Cluster validity is an important issue in cluster analysis. Its objective is to assess clustering results of the proposed algorithm by comparing existing well-known clustering algorithms. In the following, we adopt two popular measures, purity and normalized mutual information (NMI), in order to evaluate the quality of HDC-Stream.

The clustering quality is evaluated by the average purity of clusters which is defined as follows:

The normalized mutual information (NMI) is a well-known information theoretic measure that assesses how similar two clusterings are. Given the true clustering

The parameters of HDC-Stream adopt the following settings: decay factor

Figure

Cluster purity of HDC-Stream for EDS with (a) horizon = 1 and stream speed = 2000 and (b) horizon = 5 and stream speed = 2000.

The same is observed from the normalized mutual information aspect. In fact, Figure

Normalised mutual information of HDC-Stream for EDS with (a) horizon = 1 and stream speed = 2000 and (b) horizon = 5 and stream speed = 2000.

We noted very good clustering quality of HDC-Stream, D-Stream, and DenStream when no noise is present in the dataset. In fact, purity values are always higher than 98% and all methods are insensitive to the horizon length.

The comparison results among HDC-Stream and both DenStream and D-Stream on the Network Intrusion dataset are shown in Figure

Cluster purity of HDC-Stream for Network Intrusion Detection dataset with (a) horizon = 2 and stream speed = 1000 and (b) horizon = 5 and stream speed = 1000.

We show the normalized mutual information results on Network Intrusion Detection dataset in Figure

Normalised mutual information of HDC-Stream on Network Intrusion Detection dataset with (a) horizon = 1 and stream speed = 1000, (b) horizon = 5 and stream speed = 1000.

The execution time of HDC-Stream is influenced by the number of data points processed at each time unit, that is, the stream speed. Figure

Execution time for increasing stream lengths on Network Intrusion Detection dataset.

DenStream has higher processing time due to its merging task which is time consuming. HDC-Stream has lower execution time compared to the others. The execution time of other methods increases linearly with respect to the stream speed.

Memory usage of HDC-Stream is

An important parameter of HDC-Stream is

Cluster quality versus decay factor.

We proposed a hybrid method for clustering evolving data streams which has high quality and low computation time compared to existing methods. The algorithm clusters data streams in three distinctive steps. In existing methods such as DenStream, when a new data point arrives, it takes time to search in two lists of microclusters including potentials and outliers in order to find the suitable microcluster. If it is unable to find a microcluster, DenStream forms a new microcluster for that data point which may be a seed of an outlier, hence leading to a low clustering quality result. However, HDC-Stream only searches in potential list and if it cannot find the suitable microcluster, the data point is mapped to the grid, which keeps the outlier buffer. We reduced the time complexity of clustering algorithm using grid-based clustering. The grid-based method allows us to decrease merging time complexity from

We reduced the number of comparisons; therefore, time complexity for merging to minicluster list is

Finally, the evaluation results prove that using a hybrid method for clustering evolving data streams improves the clustering quality results and reduces the computation time.

In this paper, we proposed a hybrid density-based clustering algorithm for Internet of Things (IoT) streams. Our hybrid algorithm has three steps in which the new data point is either mapped to grid or merged to an existing minicluster, the outliers are removed, and finally arbitrary shape clusters are formed using miniclusters by a modified DBSCAN. Our method is a hybrid one, which uses density grid-based clustering and density microclustering to improve the computation time and quality. The evaluation results on synthetic and real datasets show that it has high quality with low computation time for merging. However, HDC-Stream is not suitable to be used in distributed environments.

Our future work will focus on the improvement of HDC-Stream as a distributed density-based data stream clustering algorithm.

The authors declare that there is no conflict of interests regarding the publication of this paper.

This research is supported by High Impact Research (HIR) Grant, University of Malaya, no. UM.C/625/HIR/MOHE/SC/13/2 from Ministry of Higher Education.