With the development of Internet of Everything such as Internet of Things, Internet of People, and Industrial Internet, big data is being generated. Clustering is a widely used technique for big data analytics and mining. However, most of current algorithms are not effective to cluster heterogeneous data which is prevalent in big data. In this paper, we propose a high-order CFS algorithm (HOCFS) to cluster heterogeneous data by combining the CFS clustering algorithm and the dropout deep learning model, whose functionality rests on three pillars: (i) an adaptive dropout deep learning model to learn features from each type of data, (ii) a feature tensor model to capture the correlations of heterogeneous data, and (iii) a tensor distance-based high-order CFS algorithm to cluster heterogeneous data. Furthermore, we verify our proposed algorithm on different datasets, by comparison with other two clustering schemes, that is, HOPCM and CFS. Results confirm the effectiveness of the proposed algorithm in clustering heterogeneous data.
With the rapid development of the Internet of Things, Internet of People, and Industrial Internet, big data analytics and mining have become a hot topic [
Heterogeneous data, different from the homogeneous data containing only one type of objects, involves multiple interrelated types of objects [
In this paper, we propose a high-order CFS algorithm (HOCFS) for clustering heterogeneous data based on the dropout deep learning model. The dropout deep learning model was proposed by Hinton to prevent overfitting [
Finally, we compare our proposed algorithm with two representative data clustering techniques, namely, HOPCM and CFS, on two datasets, namely, NUS-WIDE and CUAVE in terms of
Therefore, the contributions of the paper are summarized as the following three aspects: Current dropout deep learning models are of low effectiveness and efficiency in learning features for heterogeneous data. To tackle this problem, the paper proposes an adaptive dropout deep learning model to learn features for each type of data and then fuses the learned features to form a feature tensor for each heterogeneous data object. To measure the similarity between heterogeneous data objects in high-order tensor space, the paper applies the tensor distance in the clustering process. Conventional CFS algorithm cannot cluster heterogeneous data directly because it works in the vector space. The paper extends the CFS algorithm from the vector space to the tensor space for clustering heterogeneous data represented by the feature tensors.
This section presents the technique preliminaries about our scheme, including the stacked autoencoder, dropout, and the CFS clustering algorithm. The stacked autoencoder is presented first, followed by the CFS clustering algorithm.
The stacked autoencoder (SAE) that is one important example of deep learning models has been widely employed in supervised feature learning for many applications [
The architecture of the stacked autoencoder.
As the typical module of a stacked autoencoder, a basic autoencoder (BAE) [
Then, BAE reconstructs the input from the hidden representation
To train the parameter of the autoencoder, an objective function with a weight-decay that is used to prevent overfitting is defined as follows:
The stacked autoencoder is a full-connected model and it involves many redundant connections. Therefore, it usually produces overfitting in the real applications. Aiming at this problem, Hinton proposed dropout to reduce the overfitting by preventing coadaption of feature detectors in deep learning models. It randomly omits half of the feature detectors on each training sample to prevent a hidden unit from relying on other hidden units being present. Dropout was proved to be especially effective and efficient in training a large neural network with a small training set.
CFS is the latest clustering algorithm proposed by Laio and Rodriguez in Science Magazine in 2014 [
The key of the CFS algorithm lies in the characterization of cluster centers. Particularly, the algorithm basically assumes that cluster centers should be surrounded by neighbor objects with lower local density and be more far away from other objects with a higher local density. Based on this assumption, CFS defines two quantities for every data object
In the CFS algorithm, cluster centers are recognized as the objects with the large value of
Consider a dataset with
In this section, we describe the details of the proposed high-order CFS algorithm based on the dropout deep learning model for clustering heterogeneous data. The proposed algorithm works in three stages: unsupervised feature learning, feature fusion, and high-order clustering, which is shown in Figure
The architecture of the proposed scheme.
In the first stage, each type of data in the heterogeneous dataset is separately learned by the proposed adaptive dropout deep learning model. In the second stage, the proposed algorithm uses the vector outer product to fuse the learned futures to form a feature tensor as the joint representation of each object. Finally, the proposed algorithm extends the conventional CFS technique from the vector space to the tensor space for clustering the heterogeneous dataset.
In the dropout deep learning model, each hidden unit is randomly omitted from the network always with a constant probability of 0.5. This way will ignore the relationship between the omitting probability and the layer opposition, resulting in a low effectiveness of deep learning models in heterogeneous data feature learning. A large number of studies demonstrate that the fundamental layers of a deep architecture share many common characters, implying that the dropout in the lower layers has more generalization function than that in higher layers. Therefore, the omitting probability of the dropout should decay with the layers becoming higher.
Based on the above analysis, we propose an adaptive dropout deep learning model by defining a distribution model of the omitting probability
Function ( it is monotonically decreasing. The omitting probability is 0.5 for the middle hidden layer. The omitting probability is always in (
(1) By the assumption, function
(2) When
When
(3) Based on property (1),
Then,
Therefore, the omitting probability is always in (
We can get the adaptive dropout deep learning model by applying the distribution function of the omitting probability to the deep learning model outlined in Algorithm
(1) Randomly initialize all (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15) (16) (17) (18) (19) (20) (21) (22) (23) (24) (25)
In the proposed high-order CFS algorithm, the adaptive dropout deep learning model is used to learn features of each type of data of the heterogeneous data.
The vector outer product is one of the widely used operations in mathematics, denoted by
More generally, the outer product of
After using the adaptive deep learning model to learn features of heterogeneous data, each type of data can be represented by a feature vector. Particularly, for the heterogeneous dataset in which each object consists of one image, one text, and one piece of video, three vectors, For the object with only one image and one text, its feature tensor is represented by For the object with only one image and one piece of video, its feature tensor is represented by For the object with only one text and one piece of video, its feature tensor is represented by For the object with only one image, one text, and one piece of video, its feature tensor is represented by
As discussed in Section
To calculate the distance between two points in high-order tensor space, represented by two tensors,
The proposed high-order CFS clustering algorithm (HOCFS) based on the feature tensor is outlined in Algorithm
(1) (2) (3) (4) (5) (6) (7) (8) (9) Select clustering centers according to (10) (11)
In this part, we assess the adaptive dropout deep learning model on the STL-10 and CIFAR-10 datasets by comparison with the conventional dropout model.
We initially explored the effectiveness of adaptive dropout using STL-10, a widely used benchmark for machine learning algorithms. It contains 500 training images, 800 testing images that are grouped by 10 classes, and 100000 unlabeled images for unsupervised learning. We combine the adaptive dropout distribution model with stacked autoencoders to train two deep learning models. One has 4 hidden layers while the other has 5 hidden layers. Both of them have one logistic regression layer on the top. For the adaptive dropout deep learning model, we use the proposed algorithm to set the omitted rate of hidden units while setting omitted rate of 0.5 of hidden units for the conventional dropout deep learning model. The classification results are presented in Figures
Classification result on STL-10 with 4 hidden layers.
Classification result on STL-10 with 5 hidden layers.
From Figures
CIFAR-10 is a benchmark task for object recognition, consisting of 60000 color images in 10 groups, with 6000 images per group. These images were labeled by hand to produce 50000 training images and 10000 test images. We built a classification network with three convolutional layers and three pooling and two fully connected layers to explore the effectiveness of the adaptive dropout model on CIFAR-10 dataset. Each convolutional layer has an exclusive ReLU layer and a dropout layer. Specially for the adaptive dropout deep learning model, we use the proposed algorithm to set the omitted rate of hidden units while setting omitted rate of 0.5 of hidden units for the conventional dropout deep learning model. The classification results are presented in Figure
Classification result on CIFAR-10.
From Figure
In this part, we evaluate the high-order CFS clustering algorithm by comparison with the HOPCM algorithm and the conventional CFS algorithm on two representative heterogeneous datasets, namely, NUS-WIDE and CUAVE, in terms of
HOCPM was developed in 2015 for clustering heterogeneous data by combining the autoencoder model and the possibilistic
The evaluation criteria are described in Section
The NUS-WIDE dataset is the biggest image set, consisting of 269, 648 annotated images. To compare the proposed algorithm with the HOPCM algorithm and the conventional CFS algorithm fairly, we use the same image dataset collected from the NUS-WIDE with literature [
First, we carried out the experiments on the overall image set for five times. The clustering results are shown in Figures
Clustering result on NUS-WIDE in terms of
Clustering result on NUS-WIDE in terms of
Figure
From Figure
Next, we carried out the experiment on the 8 subsets for 5 times to evaluate the robustness of the clustering algorithms. Tables
Clustering result on NUS-WIDE in terms of
Algorithm/subset | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
---|---|---|---|---|---|---|---|---|
CFS | 2.64 | 3.01 | 2.99 | 3.04 | 2.73 | 3.02 | 3.08 | 2.82 |
HOPCM | 2.04 | 2.57 | 2.91 | 2.63 | 2.12 | 2.91 | 2.99 | 2.08 |
HOCFS | 1.96 | 2.24 | 2.37 | 2.28 | 1.95 | 2.16 | 2.39 | 2.01 |
Clustering result on NUS-WIDE in terms of RI.
Algorithm/subset | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
---|---|---|---|---|---|---|---|---|
CFS | 0.86 | 0.79 | 0.87 | 0.82 | 0.76 | 0.79 | 0.83 | 0.69 |
HOPCM | 0.91 | 0.84 | 0.93 | 0.91 | 0.88 | 0.92 | 0.82 | 0.84 |
HOCFS | 0.95 | 0.84 | 0.94 | 0.95 | 0.93 | 0.96 | 0.89 | 0.91 |
From Tables
CUAVE is a typical multimodal dataset consisting of some digits, 0 to 9, reported by 36 individuals. To assess HOCFS for clustering heterogeneous data, we added some annotations to each object as the literature [
We first carried out the experiment on the CUAVE dataset for 5 times to judge HOCFS for clustering heterogeneous data in terms of RI. The result is presented in Figure
Clustering result on CUAVE in terms of
According to Figure
Next, we evaluate the robustness of the proposed algorithm by generating three different subsets, each with a distinct combination of two modalities. We carried out the experiment on these subsets for 5 times. The results are shown in Figures
Clustering result on image-text subset in terms of
Clustering result on text-audio subset in terms of
Clustering result on image-audio subset in terms of
According to Figures
Finally, we studied the relationship between the clustering result and the different combinations of modalities by analyzing the clustering results, as shown in Table
Clustering result on different subsets in terms of RI.
Algorithm/subset | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
Image-text | 0.92 | 0.88 | 0.87 | 0.89 | 0.93 |
Text-audio | 0.81 | 0.79 | 0.78 | 0.83 | 0.80 |
Image-audio | 0.89 | 0.83 | 0.89 | 0.85 | 0.87 |
Overall | 0.96 | 0.91 | 0.93 | 0.96 | 0.94 |
From Table
In this paper, we proposed a high-order CFS algorithm for clustering heterogeneous data. One property of the paper is to devise an adaptive deep learning model and to apply it to learning features of each type of data. Furthermore, the vector outer product was used to model the correlations of each type of data to form a feature tensor for every heterogeneous data object. Another property of the proposed algorithm is to adopt the tensor distance to measure the similarity between every two heterogeneous objects. Experimental results showed that our proposed algorithm produced more accurate results than HOPCM and CFS in terms of
Recently, more and more complex heterogeneous data have been generated in many applications. For example, there are simultaneously many images and audio pieces in one web document. The future work will focus on how to cluster such complex heterogeneous dataset.
The authors declare that they have no competing interests.