A Novel Text Clustering Approach Using Deep-Learning Vocabulary Network

Text clustering is an effective approach to collect and organize text documents into meaningful groups for mining valuable information on the Internet. However, there exist some issues to tackle such as feature extraction and data dimension reduction. To overcome these problems, we present a novel approach named deep-learning vocabulary network. The vocabulary network is constructed based on related-word set, which contains the “cooccurrence” relations of words or terms. We replace term frequency in feature vectors with the “importance” of words in terms of vocabulary network and PageRank, which can generate more precise feature vectors to represent the meaning of text clustering. Furthermore, sparse-group deep belief network is proposed to reduce the dimensionality of feature vectors, and we introduce coverage rate for similarity measure in Single-Pass clustering. To verify the effectiveness of our work, we compare the approach to the representative algorithms, and experimental results show that feature vectors in terms of deep-learning vocabulary network have better clustering performance.


Introduction
Webpages, microblogs, and social networks provide much useful information for us, and text clustering is an important text mining method to collect valuable information on the Internet.Text clustering helps us to group an enormous amount of text documents into small meaningful clusters, which have been used in many research fields such as sentiment analysis (opinion mining) [1][2][3], text classification [4][5][6], text summarization [7], and event tracking and topic detection [8][9][10].
The process of text clustering is usually divided into two phases: preprocessing phase and clustering phase.Before preprocessing phase, there are some basic steps (including tokenization, remove-stop-words, and stemming-word) needed to process text documents, and these steps split sentences into words and remove useless words or terms.
The first phase is the preprocessing of text, and the second phase is clustering for text documents.The preprocessing phase is mainly to transform text documents into structured data that can be processed by clustering algorithms.This phase contains two parts: feature extraction and feature selection.
In existing scientific literatures, there are two categories of feature extraction methods: term frequency-based method and semantic web-based method.Term frequency-based method is a method of counting words' number, and semantic web is to construct the knowledge in certain domain to an ontology, which contains words and their relations.
Term-document vectors are extracted from text documents in the process of feature extraction.Most term frequency-based methods employ vector space model (VSM) to represent text documents, and each entry of VSM is the frequency of words or terms.The most representative method based on term frequency is term frequency-inverse document frequency (tf-idf) algorithm.For its simplicity and high efficiency, researchers have proposed many improved tfidf algorithms [11,12].
However, the relations of words (or word order) are lost when text documents are transformed into term-document vectors.Many researchers find that the words or terms have lexical "cooccurrence" phenomenon [13], which means some 2 Mathematical Problems in Engineering words or terms have a high probability of occurrence in a text document.Researchers think that the "cooccurrence" relations of words or terms can generate more precise feature vectors to represent the meaning of text documents.
The objective of feature selection is to remove redundant information and reduce the dimensionality of termdocument vectors.The methods of feature selection are categorized as corpus-based method, Latent Semantic Indexing (LSI), and subspace-based clustering.The corpus-based method merges synonyms together to reduce the dimensionality of features, which depends on large corpora such as WordNet and HowNet.Traditional LSI decomposes a termdocument vector into a term-space matrix by singular value decomposition (SVD).Subspace-based clustering groups text documents in a low-dimensional subspace.
In our paper, we propose a novel approach to address two issues: one is the loss of word relations in the process of feature extraction, and the other is to retain the word relations in dimension reduction.Considering that the relations of words and terms are lost in term frequency-based methods, we construct a vocabulary network to retain "cooccurrence" relations of words or terms.Term frequency is replaced with the "importance" of words or terms in VSM.Furthermore, traditional feature selection methods can lose some information that affects the performance of clustering [14], and we introduce deep learning for dimension reduction.
The main contributions of our paper are that we present a novel graph-based approach for text clustering, called deeplearning vocabulary network (DLVN).We employ the edges of vocabulary network to represent the relations between words or terms and extract features of text documents in terms of related-word set.The related-word set is a set of words in the same class, and we utilize association rules learning to obtain relations between words.In addition, high dimensional and sparse features of text have a big influence on clustering algorithms, and we employ deep learning for dimensionality reduction.Accordingly, an improved deeplearning Single-Pass (DL-SP) is used in the process of clustering.To verify the effectiveness of the approach, we provide our experimental evaluation based on Chinese corpora.
The rest of this paper is organized as follows.Section 2 reviews related work in previous literatures.Section 3 introduces theoretical foundation related to this paper.Section 4 describes the approach of DLVN we propose.Section 5 is experimental analysis.Section 6 is the conclusion of our work.

Related Work
Text clustering groups text documents of similar content (so-called topic) into a cluster.In this section, we use three subsections to review related literatures.

Feature Extraction.
Term frequency-based method is an important method to extract features.In term frequencybased method, text documents are represented as VSM, and each document is transformed into a vector, whose entries are the frequency of words or terms.Most term frequency-based methods are to improve tf-idf.
Semantic web is to structure knowledge into an ontology.As researchers find that the relations between words contribute to understanding the meaning of text, they construct a semantic network in terms of concepts, events, and their relations.Yue et al. [15] constructed a domain-specific ontology to describe the hazards related to dairy products and translated the term-document vectors (namely, feature vectors of text) into a concept space.Wei et al. [16] exploited an ontology hierarchical structure for word sense disambiguation to assess similarity of words.The experiment results showed better clustering performance for ontology-based methods considering the semantic relations between words.Bing et al. [17] proposed an adaptive concept resolution (ACR) model for the characteristics of text documents, and ACR was an ontology-based method of text representation.However, the efficiency of semantic web analysis is a challenge for researchers, and the large scale of text corpora has a great influence on algorithms [18].
For retaining the relations of words and terms, some researchers proposed to employ graph-based model in text clustering [19,20].Mousavi et al. [21] proposed a weightedgraph representation of text to extract semantic relations in terms of parse trees of sentences.In our work, we introduce frequent itemsets to construct related-word set, and use each itemset of related-word set to represent the relations between words.Language is always changing, and new words are appearing every day.Related-word set can capture the change of language by mining frequent itemsets.

Feature Selection.
Feature selection is a feature construction method to transform a high dimensional feature space into a low-dimensional feature space.SVD is a representative method using mathematical theory for dimension reduction.Jun et al. [22] combined SVD and principalcomponent analysis (PCA) for dimensionality reduction.Zhu and Allen [23] proposed a latent semantic indexing subspace signature model (LSISSM) based on LSI and transformed term-document vectors into a low-rank approximation for dimensionality reduction.However, LSI selects a new feature subset to construct a semantic space, which loses some important features and suffers from the irrelevant features.
Due to the sparsity and high-dimensionality of text features, the performance of the subspace-based clustering is better than traditional clustering algorithm [24,25].Moreover, some researchers integrate many related theories for dimensionality reduction.Bharti and Singh [26] proposed a hybrid intelligent algorithm, which integrated binary particle swarm optimization, chaotic map, dynamic inertia weight, and mutation for feature selection.

Clustering Algorithm.
Clustering is an unsupervised approach of machine learning, and it groups similar objects into a cluster.The most representative clustering algorithm is partitional clustering such as k-means and k-medoids [27], and each cluster has a center called centroid in partitional clustering.Mei and Chen [28] proposed a clustering around weighted prototypes (CAWP) based on new cluster representation method, where each cluster was represented by multiple objects with various weights.Tunali et al. [29] improved spherical k-means (SKM) and proposed a multicluster spherical k-means (MCSKM), which allowed documents to be assigned more than one cluster.Li et al. [30] introduced a concept of neighbor and proposed a parallel k-means based on neighbors (PKBN).
Another representative clustering algorithm is hierarchical clustering, which contains divisive hierarchical clustering and agglomerative hierarchical clustering [31].Peng and Liu [32] proposed an incremental hierarchical text clustering approach, which represented a cluster hierarchy using CFutree.In addition, Chen et al. [33] proposed an improved density clustering algorithm named density-based spatial clustering of applications with noise (DBSCAN).DBSCAN was sensitive to choosing parameters; the authors combined k-means to estimate the parameters.
Ensemble clustering is another clustering algorithm.Ensemble clustering combines the multiple results of different clustering algorithms to obtain final results.Multiview clustering is an extension of ensemble clustering and combines different data that have different properties and views [34,35].
Matrix factorization-based clustering is an important clustering approach [36].Lu et al. [37] proposed a semisupervised concept factorization (SSCF), which contained nonnegative matrix factorization and concept factorization for text clustering.SSCF integrated penalized and reward terms by pairwise constraints must-link constraints  ML and cannot-link constraints  CL , which implied two documents belonging to the same cluster or different clusters.
Topic-based text clustering is an effective text clustering approach, in which text documents are projected into a topic space.Latent Dirichlet allocation (LDA) is a common topic model.Yau et al. [38] separated scientific publications into several clusters based on LDA.Ma et al. [39] employed the topic model of LDA to represent the centroids of clusters and combined k-means++ algorithm for document clustering.
In some literatures, additional information is introduced for text clustering such as side-information [40] and privileged information [41].What is more, several global optimization algorithms are utilized for text clustering such as particle swarm optimization (PSO) algorithm [42,43] and bee colony optimization (BCO) algorithm [44,45].
Similarity measure is also an important issue in text clustering algorithms.To compute the similarity between a text document and a cluster is a fundamental problem in clustering algorithms.The most common similarity measure is distance metric such as Euclidean distance, Cosine distance, and Generalized Mahalanobis distance [46].There exist other similarity measure methods such as IT-Sim (an informationtheoretic measure) [47].Besides similarity measure, measurement of discrimination information (MDI) is an opposite concept to compute the relations of text documents [48][49][50].

Theoretical Foundation
In this section, we describe some theories related to our work.This section contains three subsections, which are frequent pattern maximal (FPMAX), PageRank, and deep belief network (DBN).

FPMAX.
FPMAX is a depth-first and recursive algorithm for mining maximal frequent itemsets (MFIs) in given dataset [51].Before FPMAX is called, frequent pattern tree (FP-tree) is structured to store frequent itemsets, and each branch of the FP-tree is a representation of a frequent itemset.FPtree includes a linked list head, which contains all items of the dataset.Maximal frequent itemset tree (MFI-tree) is introduced to store all MFIs in FPMAX.The procedure of FPMAX is described Algorithm 1.

PageRank.
PageRank is a link-based ranking algorithm, which is used in the Google search engine.Most of webpages on the Internet are connected with hyperlinks, which carry important information.Hence, some webpages pointed by many webpages are considered to include quality information.
Webpages and hyperlinks in PageRank are structured to directed graph  = (, ), where  is the set of webpages and  is the set of hyperlinks.Let  be the total number of webpages.The PageRank score of the webpage  is defined by where   is the number of page  pointing out to other webpages.Let  be a vector to represent all PageRank scores  = ( (1) ,  (2) , . . .,  ())  . ( Let  be the adjacency matrix of the graph  with Hence, (1) can be written as the system of equations with Output layer

Hidden layer k
Hidden layer 2 Hidden layer 1 Visible layer PageRank models web surfing as a stochastic process, and the theory of Markov chain can be applied.However, the web graph does not meet the conditions of stochastic process, which requires  to be stochastic, irreducible, and aperiodic.After the adjustment of  to fix this problem, we obtain an improved model with where  is   ( is a column vector of all 1's) and thus  is an  ×  matrix with all 1's, and  is a parameter called damping factor.After scaling, we obtain Equation ( 6) is also transformed as follows: The computation of PageRank score is a process of iteration.
Given an initial value of , the iteration ends when the score of PageRank does not change or the change is less than a threshold.

Deep Belief Network (DBN)
. DBN is a model of deep leaning and composed of multilayer restricted Boltzmann machines (RBMs).DBN contains the input layer (visible layer), the hidden layers, and the output layer.There are connections between a layer and adjacent layer, but no connections among units in each layer.The structure of DBN is shown in Figure 1.
As shown in Figure 1, an RBM consists of two adjacent layers.The training of DBN includes two steps, pretraining and fine-tuning.RBM contains a visible layer V () and a hidden layer ℎ () .The parameters of RBM are ( () ,  () ,  () ).( () ) are the weights of connections between the visible layer and the hidden layer, and ( () ,  () ) are the bias vectors of the visible units and the hidden units.Giving an initial value to  () , the parameters are updated with where  is learning rate, and ( () ,  () ) are similar to  () .The gradient of ∇ () is obtained by Gibbs Sampling.
where [⋅] data and [⋅] Gibbs are the expectations of data samples and samples from Gibbs Sampling, and (∇ () , ∇ () ) are similar to ∇ ()  .DBN is fine-tuned with a set of labeled inputs in terms of error back propagation after the pretraining of DBN.The parameters are updated by where ∇ ()  = ℎ (−1)   ()  , and  ()  is an error vector.

Deep-Learning Vocabulary Network
In this section, we propose an approach called deep-learning vocabulary network (DLVN) for text clustering.The first step of DLVN is the construction of vocabulary network.The cooccurrence of words or terms is useful information for text clustering.We use the nodes of the vocabulary network to represent words or terms and the edges of the vocabulary network to represent the relations between words or terms.In our work, there are two methods to obtain the cooccurrence relations of words: related-word set and TongYiCi CiLin.
Frequent itemsets are used to discover the relations of items in database.We create related-word set by frequent itemsets, and each itemset of related-word set is a set of words with cooccurrence relation.PageRank is employed to obtain the "importance" of nodes (feature vectors) instead of the term frequency in VSM.Then, an improved DBN (called sparsegroup DBN) is proposed for dimensionality reduction.In the process of clustering algorithm, we present DL-SP for clustering, in which coverage rate is used for similarity measure.The procedure of DLVN is shown in Figure 2.
4.1.Related-Word Set.The relations of words or terms are important information in text documents.Usually, natural language has the fixed collocation and corresponding contexts, which means some words or terms have a high probability of occurrence in a text document.Thus, the relations between words are important to represent the meaning of text documents.In our paper, we use frequent itemsets to obtain cooccurrence relations between words or terms.
Definition 1 (related-word set).Let  = {word 1 , word 2 , . . ., word  } be the words of text documents from the same topic and sup[⋅] be the support of itemsets.Given a minimum support sup ms ,  = {word  , word  , . . ., word  } is defined as an itemset of related-word set, where sup[] > sup ms .
FPMAX is a depth-first and recursive algorithm for mining MFIs, and it is based on FP-tree to store frequent itemsets.When a database has a large scale, all itemsets of MFI-tree are detected in subset checking of FPMAX, which has a big influence on the efficiency of FPMAX.For improving the efficiency of FPMAX, we use TongYiCi CiLin and string match to compress the FP-tree.
TongYiCi CiLin is a Chinese semantic dictionary of synonyms and related words, which organizes all words as a five-layer hierarchical tree.It contains 77,343 words, which are divided into 12 major classes, 94 middle classes, and 1438 small classes.The fourth layer and the fifth layer are further divided into word groups and atomic word groups.We use Figure 3 to illustrate the structure of TongYiCi CiLin.
TongYiCi CiLin maps an atomic word group into a code: the first layer and the fourth layer are capital letters, the second layer is a lowercase letter, and the third layer and the fifth layer are integers.For example, code "Aa01A02" stands for the atomic word group {man, mankind, human}.We replace the words or terms with the code of word groups in MFI's mining, which contains 4223 nodes.We randomly select 10 documents from the same topic, and the frequent items (words) are listed in Table 1.As some words belong to the same word group, the number of words is compressed largely.
The structures of FP-trees that are created based on words and word groups are shown in Figure 4. Figure 4(a) is FP-tree of words, and FP-tree of word groups is shown in Figure 4(b).The nodes of FP-tree based on the word groups are fewer than the nodes of FP-tree based on the words.
The MFIs have redundant items in Figure 4(b).For example, the MFIs of Figure 4(b) are listed in Table 2.

The Construction of Vocabulary Network.
In this section, vocabulary network is constructed to represent text documents, and the vocabulary network contains the relations between words or terms.We employ the "importance" of nodes instead of term frequency in VSM.

The Selection of Vocabulary Network
Nodes.The word groups in TongYiCi CiLin are used as nodes instead of words in vocabulary network.The number of word groups is much fewer than the number of words.In addition, we choose the word groups whose frequency is higher than specified minimal frequency  min .

The Construction of Edges in the Vocabulary Network.
Edges of complex network are the important carrier of information, and the edges of the vocabulary network are used in calculating the "importance" of nodes.Considering the semantic and related information among words of terms, an edge is add to the vocabulary network in terms of the similarity of nodes.Therefore, we add an edge to the vocabulary network if word groups have a closer position Procedure FPMAX-RS(T) Input: T (an FP-tree), cov min Global: MFIT: an MFI-tree Head: a linked list of items Output: The MFIT that contains all MFI's Method: (1) if  only contains a single path P (2) ifcov(Head ∪ , MFI) > cov min (3) combine MFI-tree to this path; (4) else (5) insertHead∪  into MFIT; (6) else for each  in Header-table of T (7) append to Head; (8) construct the Head-pattern base; (9) Tail={frequent items in base}; (10) subset_checking (Head ∪ Tail); (11) if Head ∪ Tail is not in MFI-tree (12) constructtheFP-tree Head ; (13) call FPMAX-RS( Head ); (14) remove from Head.
in TongYiCi CiLin.The semantic similarity of word groups sim(, ) is defined as where depth(, ) is the depth of the first common father node,  is the depth of  and , TN is the total number of word groups, and Dis(, ) denotes the distance between  and .
For example, there are two words {, ℎ}, and the word group codes of {, ℎ} are {Bo21A, Bo25}.Because two nodes are in fourth layer, the first common father node is {}, which is in the second layer.In addition, the fourth layer contains 4223 word groups, and Dis(, ) of {Bo21A, Bo25} is 14.Therefore, sim(Bo21A, Bo25B) is calculated as follows.
The nodes in the vocabulary network are traversed, and an edge between  and  is added when sim(, ) > sim min (the specified threshold).
In addition, we add an edge between two nodes if an MFI in related-word set includes the words, and each MFI in related-word set is a word set with cooccurrence relations.In fact, the meaning of words in an MFI is not similar, and an MFI includes a group of words cooccurring in the same topic documents.When a text document has the words in an MFI, the text document has a high probability of belonging to certain topic.Therefore, we add an edge into the vocabulary network with low-frequency word pointing to high-frequency word.

The Extraction of Feature Vectors.
In the vocabulary network, the number and the direction of edges reflect the importance of nodes, which is similar to evaluating the importance of webpages.Thus, PageRank is utilized to obtain the importance of nodes, and the initial value PR  of nodes is defined by where   is the frequency of word groups.After iterative computation and normalization of PR  , we use the PageRank scores of nodes as the feature vectors of text documents instead of term frequency in this paper.

Deep-Learning Single-Pass (DL-SP).
In this paper sparsegroup DBN is proposed for dimensionality reduction of feature vectors.DBN is a model of deep learning.Luo et al. [52] found that the units of hidden layers exhibited statistical dependencies and proposed a regularization constant to restrict the relations in hidden layers.Due to the sparsity of feature vectors, we combine the word dependencies and DBN to propose a sparse-group DBN for dimensionality reduction.
In addition, coverage rate (CoR) is proposed for similarity measure among feature vectors in DL-SP.

Sparse-Group DBN.
Deep learning simulates the process of human thinking, and the result of deep learning is the distributed representation of an input vector.By analyzing feature vectors extracted from the vocabulary network, we find that there exists statistical dependency between entries of feature vectors, which means the entries of feature vectors will cooccur in the part of feature vectors.The word dependency is also mentioned by many researchers in previous literatures [5,18,53].Cooccurrence relations are typically collected in feature vectors, which means a unique word commonly referring to "target word", and the word dependency is quantified to measure words similarity in text clustering.We provide an example, which is the part of a feature vector in Table 3.
Because the documents in the same topic usually include related words, a part of units in visible layer is active simultaneously, and accordingly the documents in different topics usually activate different part of units.Based on this observation, we add a regularization constant to the log-likelihood of training data to retain these relations.In experiments, we use different topic documents to train the sparse-group DBN.The sequence of units in output layer is adjusted accordingly, and the cooccurring units are divided into one group.In other words, the feature vectors of different topic documents can activate different group of units in output layer.The structure of sparse-group DBN is shown in Figure 5.
Sparse-group DBN is comprised of several RBMs, and two adjacent layers are an RBMs.For retaining the dependency of the units in output layer, we define the activation probability of each group.Given a group  = { 1 ,  2 , . . .,   } and training sample V () , the group probability   (⋅) is given by  The output layer of the sparse-group DBN is divided into  groups, and the probability of output layer  ol (⋅) is defined by We add a regularization constant  and  ol (V () ) to optimization function, which is maximum likelihood estimate of energy function of an RBM.The optimization function is defined by max ,, Equation ( 11) is improved to (21) accordingly, and ∇ ()  is defined by where  = (/( ) ). Accordingly, the gradient of (∇ () , ∇ () ) is defined by where  = ( 1 ,  2 , . . .,   ) is the feature vector of a cluster (named topic feature vector) and  = ( 1 ,  2 , . . .,   ) is the feature vector of new document.Moreover, the addition of many text documents to clusters has an influence on topic feature vector.In our work, we introduce optional topic feature vector   = (  1 ,   2 , . . .,    ) and the weight of feature vector to solve this problem.We provide an example of optional topic feature vector in Figure 6.
When the weight of optional topic feature vector is greater than a specified threshold in each time interval, we replace topic feature vector with optional topic feature vector as new cluster center.The weight of topic feature vector is defined by where  (− 0 ) is time damping function, and (  ) is frequency function.

Experimental Analysis
In this section, we conduct three sets of experiments to validate the effectiveness of the proposed approach, including the efficiency of FPMAX-RS in related-word set mining, the comparison of feature vectors, and the comparison of DL-SP efficiency.In this work, three Chinese text corpora, TanCorpV1.0,Encyclopedia of China, and Sogou Corpus, are used as the experimental datasets.

The Comparison of Feature Vectors.
In this work, we compare the distance among the feature vectors based on tf-idf, FC-VSM [12], and DLVN.We randomly choose two documents from the category museum and one document in other categories including property, education, and military.The aim of feature extraction is to extract the feature vectors that can represent the meaning of text documents.In other words, feature vectors in different categories have longer distance.Therefore, we compute the Euclidean distance of feature vectors in different categories based on tf-idf, FC-VSM, and DLVN.Table 4 shows the results in different categories of text documents.
In the following experiment, feature vectors are extracted based on tf-idf, FC-VSM, and DLVN.Then, k-means is applied for clustering.We evaluate clustering performance  Because seven categories of text documents are chosen in our experiment, the specified number of clusters  is 7. Figure 8 illustrates that feature vectors based on DLVN have better performance.

The Comparison of DL-SP Efficiency.
In this experiment, we choose text documents from the datasets, and the number of each category is listed in Table 5.
The aim of the experiment is to compare DL-SP with LSI and Single-Pass.The sparse-group DBN has 3 layers, and the    In this subsection, we compare the running time of DL-SP and Single-Pass, and the result is listed in Table 6.

Conclusions
In this paper, we propose an approach DLVN for text clustering.The existing term frequency-based methods only calculate the number of words, but the relations of words are not considered in feature extraction.The approach constructs vocabulary network to mine the importance of words using related-word set, which contains "cooccurrence" relations of words.Therefore, the text features of documents in the same category have shorter distance, and feature vectors have longer distance among different categories.Moreover, we employ sparse-group DBN to reduce the dimensionality of feature vectors in terms of the group relations of words.Thus, sparse-group DBN can retain the word dependency in dimensionality reduction.In the experiments, we compare the approach with well-known methods to verify our work, and the results show the performance of DLVN.
In current work, we verify the approach using Chinese corpora.We will use English text to prove the approach effectiveness in the future work.Moreover, in the process of dimension reduction, we need to train the sparse-group DBN using a large amount of text documents to improve its performance.

Table 2 :
MFIs of FP-tree based on word groups.

Figure 6 :
Figure 6: An example of optional topic feature vector.

Table 1 :
The comparison of words and word groups.
).Let MFIS = {MFI 1 , MFI 2 , . . ., MFI  } be the MFI's set obtained from text documents and cov(⋅) be the number of the same items in two MFIs.Suppose that cov(MFI 1 , MFI 2 ) > cov min , where cov min is minimum number of the same items.MFI 1 and MFI 2 are removed from MFIS, and the combination of MFI 1 ∪ MFI 2 is add to MFIS.

Table 3 :
The word dependencies of a feature vector.

Table 5 :
The datasets of experiment.layer is 4223, 3500, and 3000.In addition, the group number  of top layer is 200.The structure of sparsegroup DBN is shown in Figure9.The experimental result is shown in Figure10.DL-SP has better performance than LSI and Single-Pass in sport, military, property, education, and health.However, F_measure of DL-SP is lower than LSI and Single-Pass in category car due

Table 6 :
The running time of DL-SP and Single-Pass.
to the smaller number of documents not training the sparsegroup DBN effectively.