A Tensor CP Decomposition Method for Clustering Heterogeneous Information Networks via Stochastic Gradient Descent Algorithms

Clustering analysis is a basic and essential method for mining heterogeneous information networks, which consist of multiple types of objects and rich semantic relations among different object types. Heterogeneous information networks are ubiquitous in the real-world applications, such as bibliographic networks and social media networks. Unfortunately, most existing approaches, such as spectral clustering, are designed to analyze homogeneous information networks, which are composed of only one type of objects and links. Some recent studies focused on heterogeneous information networks and yielded some research fruits, such as RankClus and NetClus. However, they often assumed that the heterogeneous information networks usually follow some simple schemas, such as bityped network schema or star network schema. To overcome the above limitations, wemodel the heterogeneous information network as a tensor without the restriction of network schema. Then, a tensor CP decomposition method is adapted to formulate the clustering problem in heterogeneous information networks. Further, we develop two stochastic gradient descent algorithms, namely, SGDClus and SOSClus, which lead to effective clustering multityped objects simultaneously.The experimental results on both synthetic datasets and real-world dataset have demonstrated that our proposed clustering framework can model heterogeneous information networks efficiently and outperform state-of-the-art clustering methods.


Introduction
Heterogeneous information networks are ubiquitous in the real-world applications. Generally, heterogeneous information networks consist of multiple types of objects and rich semantic relations among different object types. The bibliographic network extracted from the DBLP database (http://www.informatik.uni-trier.de/∼ley/db/) is a typical heterogeneous information network, as shown in Figure 1. The DBLP database is an open resource that contains most bibliographic information on computer science. The bibliographic network contains four types of objects: author (A), paper (P), venue (i.e., conference or journal) (V), and term (T). The edges are labeled by "write" or "written by" between author and paper or labeled by "publish" or "published by" between paper and venue or labeled by "contain" or "contained in" between paper and term.
Clustering analysis is a basic and essential method for mining such networks, which can help us better understand the semantic information and interpretable structure in the network. Unfortunately, most existing approaches, such as spectral clustering, are designed to analyze homogeneous information networks [1] that consist of only a single type of objects and links, while the real-world situations are often heterogeneous information networks [2] in nature with more than one type of objects and links. The mission of clustering such a heterogeneous information network is more difficult than that in a homogeneous information network, as we cannot directly measure the similarity among the different types of objects and relations.
Though some recent studies have focused on clustering heterogeneous information networks, such as RankClus [1] and NetClus [2], they can only be applied to some specific simple network schemas. RankClus can only be used to model bityped networks, where only two different types of objects exist in the network. NetClus was developed for the star network schema, where the links only appear between  target objects and attribute objects. The network schema is a metatemplate of a heterogeneous information network, which shows how many types of objects and links are there in the network [3]. Figure 1 shows a typical star network schema, where the paper (P) is the target object and others are attribute objects.
A tensor is a generalization of the matrix in the highdimensional space. It is a natural expression of complicated and interpretable structures in high-mode data. In this paper, we model a heterogeneous information network as a tensor without the restriction of network schema. Each type of objects maps onto one mode of the tensor, and the semantic relations among different object types map onto the elements in the tensor. Then, a tensor CP decomposition method is adapted to formulate the clustering problem in heterogeneous information networks. And two stochastic gradient descent algorithms are developed, which lead to effective clustering multityped objects simultaneously. The experimental results on both synthetic datasets and real-world dataset show that the proposed clustering framework can model the heterogeneous information networks efficiently and outperform the state-of-the-art clustering methods.
The rest of this paper is organized as follows. In Section 2, we discuss the related work on clustering for heterogeneous information networks and the tensor factorization. Section 3 gives some notations and definitions used in this paper. In Section 4, we formulate the clustering problem and describe two stochastic gradient descent algorithms. The experimental results on both synthetic datasets and real-world dataset are presented in Section 5. Finally, the conclusions are drawn in Section 6.

Related Work
Our work mainly focuses on the clustering heterogeneous information networks and the tensor factorization.

Clustering Heterogeneous Information Networks.
Clustering is an unsupervised learning method to recognize the distribution and hidden structures in the data, which is a basic and significant mission for pattern recognition and machine learning. Since MacQueen first proposed K-means [4] in 1967, many subtle algorithms have been developed for clustering in the past decades. However, most existing algorithms, such as hierarchical clustering algorithm [5], density-based clustering [6], mesh-based clustering [7], fuzzy clustering algorithm [8], and spectral clustering [9], are designed to analyze point sets or homogeneous information networks, which are composed of only one type of objects and links.
In real-world applications, the datasets are often organized as heterogeneous information networks, where objects and the relations between them are of more than one type. In recent years, researchers have made a significant progress on clustering for heterogeneous information networks [10,11], which largely focused on four main directions: the first is to use a ranking based clustering algorithm [1]; it developed the RankClus algorithm that integrated clustering with ranking for clustering bityped networks. Its extension, NetClus [2], was developed for the star network schema. They have proven that ranking and clustering can mutually enhance each other. GPNRankClus [12] assumed that edges in heterogeneous information networks follow a Poisson distribution. This method can simultaneously achieve both clustering and ranking in a heterogeneous information network. In addition, FctClus [13] achieved a higher computational speed and had a greater clustering accuracy when applied to heterogeneous information networks. But, same as NetClus, FctClus algorithm can only handle the star network schema. For a general network schema, HeProjI [14] projected the network into a number of bityped or star schema subnetworks and performed the ranking based clustering in each subnetwork.
The second direction involves metapath based clustering algorithms [15,16]. A metapath is a connected path defined on the network schema of a heterogeneous information network, which represents a composite semantic relation between two objects. PathSim (metapath based top-k similarity search) [3] measured the similarity between the same types of objects based on metapath in heterogeneous information networks. However, it has a limitation: the metapath must be symmetric; that is, PathSim could not work on different types of objects. The PathSelClus algorithm in [15][16][17] integrated metapath selection with user guidance to cluster objects in networks, where user provided seeds for each cluster acted as guidance.
The third direction is structural-based clustering. Sun et al. proposed a probabilistic clustering method [18] to deal with heterogeneous information networks with incomplete attributes, which integrated the incomplete attribute information and the network structure information. NetSim [19] is a structural-based similarity measurement between objects for x-star network. Xu proposed a Bayesian probabilistic model based on network structural information for clustering heterogeneous information networks.
The final direction is a clustering algorithm based on social network features. Zhou and Liu designed the SI-Cluster algorithm [20], which adopted the heat diffusion procedure Scientific Programming 3 to model the social influence and then measure the similarity between objects.

Tensor Factorization.
A tensor is a multidimensional array, in which the elements are addressed by more than two indices. Tensor factorization has been studied since the early 20th century [21][22][23][24][25]. Two of the most popular tensor factorization methods are Tucker decomposition [21,24] and canonical decomposition using parallel factors (CANDE-COMP/PARAFAC) [23,24]. The CANDECOMP/PARAFAC is also named CP decomposition. It is worth noting that CP decomposition is a special case of Tucker decomposition.
The clustering issues based on tensor factorization are often modeled as the optimization problems [26]. But it has been proven in [27] that tensor clustering formulations are NP-hard. In the past years, many approximation algorithms for tensor clustering are proposed [28][29][30]. These theories provide a new perspective for us as to clustering heterogeneous information networks. Tensor factorization based clustering has also been used in some specific applications. Examples include link prediction in higher-order network structures [31,32], collaborative filtering in recommendation systems [33], community detection in multigraphs [34], graph clustering [35], and modeling multisource datasets [36].
The Alternating Least Squares (ALS) algorithm [22,37,38] is one of the most famous and commonly used algorithms to solve the tensor factorization, which updates one component iteratively at each round, while holding the others constant. However, ALS suffers from some limitations; for example, ALS may converge to a local minimum and the memory consumption may explode when the scale of tensor is large. Nonlinear optimization approach is another option to obtain the tensor factorization, such as nonlinear conjugate gradient method [39], Newton based optimization [40], randomized block sampling method [41], and stochastic gradient descent [42]. In this paper, we adopt a stochastic gradient descent algorithm with Tikhonov regularization item loss function to process the tensor CP decomposition based clustering.

Preliminaries
First, we introduce some related concepts and notations of tensors used in this paper. More details about tensor algebra can be found in [24,43]. The order of a tensor is the number of dimensions, also known as ways or modes. We will follow the convention used in [23] to denote scalars by lowercase letters, for example, , , , vectors (one mode) by boldface lowercase letters, for example, a, b, c, matrices (two modes) by boldface capital letters, for example, A, B, C, and tensors (three modes or more) by boldface calligraphic letters, for example, X, Y, Z. Elements of a matrix or a tensor are denoted by lowercase letters with subscripts; that is, the ( 1 , 2 , . . . , )th element of an th-order tensor X is denoted by 1 , 2 ,..., . The notations about tensor algebra used in this paper are summarized in Notations.
If an th-order tensor X ∈ R 1 × 2 ×⋅⋅⋅× can be written as the outer product of vectors, that is, X = a (1) ∘a (2) ∘⋅ ⋅ ⋅∘a ( ) and a ( ) ∈ R ; = 1, 2, . . . , , tensor X is named rank-one tensor. The CP decomposition represents a tensor as a sum of rank-one tensors; that is, the CP decomposition of X is X = ∑ =1 a (1) ∘ a (2) ∘ ⋅ ⋅ ⋅ ∘ a ( ) , where is a positive integer and a ( ) ∈ R ; = 1, 2, . . . , ; = 1, 2, . . . , . The rank of a tensor is defined as the smallest number of rank-one tensors for which the equality holds in the CP decomposition and denoted as rank(X) = min . In fact, the problem of tensor rank determination is NP-hard [27].
Definition 1 (information network [3]). An information network is a graph = ( , ) defined on a set of objects and a set of links , where belongs to objects types V = {V } =1 and belongs to links types E = {R } =1 .
Specifically, when > 1 or > 1, the information network is called heterogeneous information network; otherwise, it is called homogeneous information network.
We denote the object set of type V as {V } =1 , where is the number of objects in type V ; that is, = |V | and = 1, 2, . . . , . The total number of objects in the network is given by = ∑ =1 .
Definition 2 (network schema [3]). The network schema is a metatemplate for a heterogeneous information network = ( , ), which is a graph defined over object types V and links types E, denoted by = {V, E}. A network schema = {V, E} shows how many types of objects are there in the network = ( , ) and which type the links between different object types belong to. Figure 1 shows the network schema of DBLP, which follows a star network schema.

Tensor CP Decomposition Based
Clustering DBLP network (in Figure 1, which contains four types of objects, { , , , }), denoted by , represents a semantic relation of "an Author V writes a Paper V published in the Venue V and containing the Term V ." For simplicity, we can use the subscript of each object in to mark the corresponding gene-network. In the example, the gene-network can be marked by , , , .
Let X be a th-order tensor of size 1 × 2 ×⋅ ⋅ ⋅× ; each mode of X represents one type of objects in the network . An arbitrary element, is an indicator of whether the corresponding gene-network 1 , 2 ,..., exists; that is, Then, the heterogeneous information network = ( , ) can be represented by the form of tensor as X.

Problem Formulation.
Using the tensor representation X of = ( , ), we can partition the multityped objects into different clusters by the CP decomposition. We assume that there are clusters in = ( , ) and denote U ( ) ∈ R × ; = 1, 2, . . . , as the cluster indication matrix of the th type of objects. Then, the CP decomposition of X is Each row u ( ) = [ ( ) ,1 , ( ) ,2 , . . . , ( ) , ] ⊤ of the factor matrix U ( ) is the probability vector for each object from th type belonging to the th cluster. In other words, the th cluster is composed of the th rank-one tensor in the CP decomposition; that is, u (1) Figure 2 gives an example of tensor CP decomposition method for clustering heterogeneous information network. The left one is the original network with three types of objects, the middle cube is a 3-mode tensor, and the right one is the CP decomposition of the 3-mode tensor and also is the partition of the original network. In addition, the three types of objects are the yellow round, the blue square, and the red triangle, respectively. The number within each object is the identifier of the object. Each element (black dot in the middle cube) in the tensor represents a gene-network in the network (black dashed circle in the left). Each component (black dashed circle in the right) in the CP decomposition shows one cluster of the original network.
The problem in (3) is an NP-hard nonconvex optimization problem, which has a continuous manifold of equivalent solutions [39]. In other words, the global minimum is drowned in many local minima, which makes it difficult to be found. In real-world scenarios, the objects in the heterogeneous information networks may belong to more than one cluster; that is, the clusters are overlapping. However, the number of clusters that the vast majority of objects may belong to is much smaller than the total number of clusters. That is, most of the elements in U ( ) should be zero; that is, U ( ) should be sparse. To overcome the two challenges, we can introduce a Tikhonov regularization term, proposed by Paatero in [44,45], in the objective function and replace the objective function by the following loss function: where > 0 is a regularization parameter. Let be the first squared loss function component in L and let be the Tikhonov regularization term in L, respectively. Then The Tikhonov regularization term in the loss function L has an encouraging property, which makes the Frobenius Scientific Programming 5 norms of all factor matrices in the optimization be equal; that is, Therefore, the local minima of loss function L become isolated, and any replacement and scaling of the satisfactory solutions will escape from the optimization. The details of proof can be found in [39]. Meanwhile, the Tikhonov regularization term can ensure the sparsity of the factor matrices by penalizing the number of nonzero elements. Therefore, the tensor CP decomposition method for clustering heterogeneous information networks can be formalized as where = 1, 2, . . . , ; = 1, 2, . . . , ; = 1, 2, . . . , , and < min { 1 , 2 , . . . , } is the total number of clusters. In (9), we divide X into clusters and obtain the structure of each cluster, which includes the distribution of each object. The first constraint in (9) guarantees that the sum of probabilities for each object belonging to all clusters is 1. The second constraint in (9) represents that each probability should be in the range of [0, 1]. The last constraint in (9) makes sure that there is no empty cluster for any mode.

The Stochastic Gradient Descent Algorithms.
Stochastic gradient descent is a mature and widely used tool for optimizing various models in machine learning, such as artificial neural networks, support vector machines, and logistic regression. In this section, the regularized clustering problem in (9) will be addressed by the stochastic gradient descent algorithms. The details of tensor algebra and properties used in this section can be found in [43].
First, we review the stochastic gradient descent algorithm. To solve an optimization problem, min ( ), where ( ) is a differentiable object function to be minimized and is a variable, the stochastic gradient descent method to update can be described as where is a positive number, named learning rate or step size. The convergence speed of stochastic gradient descent algorithm depends on the choice of learning rate and initial solution.
Though the stochastic gradient descent algorithm may converge to a local minimum at a linear speed, the efficiency of the algorithm near the optimal point is not all roses [46]. To speed up the final optimization phase, an extension method named second-order stochastic algorithm is designed in [46], which replaces the learning rate by the inverse of secondorder derivative of the object function; that is, Now, we apply the stochastic gradient descent and the second-order stochastic algorithm to the clustering problem in (9) and propose two algorithms, named SGDClus (Stochastic Gradient Descent for Clustering) and SOSClus (Second-Order Stochastic for Clustering), respectively.
Therefore, the partial derivative of L is given by And (12) can be rewritten as where I is an identity matrix. Note that {U ( ) } =1 derived by (18) do not satisfy the first and second constraints in (9). To satisfy these two constraints, we can normalize each row of Furthermore, the pseudocode of SGDClus is given in Algorithm 1.
Note that X ( ) (⊙ (/ ) U)(Γ ( ) + I) −1 is the General Gradientbased Optimization (OPT) [39] solution for updating U ( ) in the regularized CP decomposition by making (17) equal to zero. See the details of proof in [39]. Let Therefore, the updating rule of U ( ) in SOSClus is a weighted average of the current solution and the OPT solution intuitively; that is, Actually, SOSClus is a general extension of General Gradient-based Optimization (OPT) [39] and the ALS with step size restriction in randomized block sampling method [41]. In (24), when the learning rate = 1, we get the OPT solution. When the regularization parameter = 0, SOSClus becomes the ALS with step size restriction in randomized block sampling method.
Similar to SGDClus, {U ( ) } =1 derived by (24) in SOSClus also do not satisfy the first and second constraints in (9). We should normalize each row of {U ( ) } =1 according to (19). The pseudocode of SOSClus is given in Algorithm 2. Proof. Since the proofs for different types of objects in the heterogeneous information network = ( , ) are similar, we will simply describe the process for a single type of objects. Without loss of generality, we detail the proof on the th type of objects. Given a heterogeneous information network = ( , ) and its tensor representation X, the nonzero elements in X represent the input gene-networks in = ( , ), which we want to partition into clusters {C 1 , C 2 , . . . , C }. The centre of the cluster C is denoted by c . By using the coordinate format [47] as the sparse representation of X, the gene-networks can be denoted as a matrix M ∈ R (X)× , where (X) is the number of nonzero elements in X. Each row m ∈ M, = 1, 2, . . . , (X) gives the subscripts of corresponding nonzero element in X. In other words, m represents a gene-network and the entries , ∈ m , = 1, 2, . . . , are the subscripts of the objects contained in the gene-network.

Feasibility Analysis
The traditional clustering approach, such as K-means, minimizes the sum of differences between individual genenetwork in each cluster and the corresponding cluster centres; that is, where , is the probability of gene-network m belonging to the th cluster. Also we can rewrite the problem by a new perspective of clustering individual object in the genenetwork as follows: where , is the probability of object V belonging to the th cluster.
In the matrix form, K-means can be formalized as where P is the cluster indication matrix and C is the cluster centres. By matricization of X along the th mode, the CP decomposition of X in (3) can be rewritten as min U (1) ,U (2) ,...,U ( ) X ( ) is the matricization of X along the th mode and M is the sparse representation of X. Let U ( ) = P and let (⊙ (/ ) U) ⊤ = C. That is, U ( ) is the cluster indication matrix for the th type of objects and (⊙ (/ ) U) ⊤ is the cluster centres. So, the CP decomposition in (3) is equivalent to the K-means clustering for the th type of objects in heterogeneous information network = ( , ).
By matricization of X in (3) along different modes, we can prove that the CP decomposition is equivalent to the K-means clustering for other types of objects. So, the CP decomposition of X obtains the clustering of multityped objects in the heterogeneous information network = ( , ) simultaneously.
It is worth noting that the CP decomposition based clustering is a soft clustering method; the factor matrices indicate the probability of objects belonging to corresponding clusters. The soft clustering is more in line with reality, because many objects in the heterogeneous information networks may belong to several clusters. In other words, the clusters are overlapping. In some cases, the overlapping clusters need to be translated into nonoverlapping clusters, which can be achieved by using different approaches, such as K-means, to cluster the rows of factor matrices. Usually, the nonoverlapping clusters can be obtained by simply assigning each object to the cluster which has the largest entry in the corresponding row of factor matrix.

Time Complexity Analysis.
The main time consumption of updating each factor matrix U ( ) in SGDClus is computing L/ U ( ) . According to (17), we need to calculate . Firstly, if we successively calculate the Khatri-Rao product of − 1 matrices and a matrix-matrix multiplication when computing X ( ) (⊙ (/ ) U), the intermediate results will be of very large size and the computational cost will be very expensive. In practice, we can reduce the complexity by ignoring the unnecessary calculation. Let us observe the element of X ( ) (⊙ (/ ) U); that is, is an element in the matricization of X, which represents a corresponding gene-network in the heterogeneous information network. When ,∏ =1 ̸ = = 0, we can ignore the following Khatri-Rao product. Hence, only nonzero elements in X need to be computed. Therefore, the time complexity for computing X ( ) (⊙ (/ ) U) is ( (X) ). Secondly, the element of U ( ) Γ ( ) is given by So, the time complexity of computing is the total number of objects in the networks.
Above all, the time complexity of each iteration in SGDClus is ( (X) + ( − 1) 2 ). Note that, in the real-world heterogeneous information networks, the number of clusters and the number of object types are usually far less than ; that is, ≪ and ≪ .
According to (22), the time complexity of updating each factor matrix U ( )

Experiments and Results
In this section, we present several experiments on synthetic and real-world datasets for heterogeneous information networks and compare the performance with a number of stateof-the-art clustering methods.

Evaluation Metrics.
In the experiments, we adopt the Normalized Mutual Information (NMI) [48] and Accuracy (AC) as our performance measurements.
NMI is used to measure the mutual dependence information between the clustering result and the ground truth. Given objects, clusters, one clustering result, and the ground truth classes for the objects, let ( , ), , = 1, 2, . . . , , be the number of objects that labeled in clustering result but labeled in the ground truth. The joint distribution can be defined as ( , ) = ( , )/ , the marginal distribution of rows can be calculated as 1 ( ) = ∑ =1 ( , ), and the marginal distribution of columns can be calculated as 2 ( ) = ∑ =1 ( , ). Then, the NMI is defined as The NMI ranges from 0 to 1: the larger value of NMI, the better the clustering result. AC is used to compute the clustering accuracy that measures the percent of the correct clustering result. AC is defined as where map( ) is the cluster label of the object and the label( ) is the ground truth class of the object . And (⋅) is an indicator function: Since both of NMI and AC are used to measure the performance of clustering one type of object, the weighted average NMI and AC are also used to measure the performance of STFClus and other state-of-the-art methods: is the number of object types in the heterogeneous information network and also the number of modes in the tensor. is the number of clusters. is the network scale, and = 1 × 2 × ⋅ ⋅ ⋅ × . is the density of the tensor, that is, the percentage of nonzero elements in the tensor, and = (X)/ .

Experimental Setting.
In order to compare the performance of our proposed SGDClus and SOSClus with others impartially, all methods share a common stopping condition, that is, or iter reaches the maximum iterations. L iter and L iter−1 are the values of function L at the current, that is, (iter)th, iteration, and the previous, that is, (iter − 1)th, iteration, respectively. And we set the maximum iterations to be 1000. Throughout the experiments, the regularization parameter in SGDClus and SOSClus is fixed as = 0.001. All experiments are implemented in the MATLAB R2015a (version 8.5.0), 64-bit. And the MATLAB Tensor Toolbox (version 2.6, http://www.sandia.gov/∼tgkolda/TensorToolbox/) is used in our experiments. Since the heterogeneous information networks are often sparse in real-world scenarios, that is, most elements in the tensor X are zeros, we use the sparse format of X as proposed in [47], which has been supported by MATLAB Tensor Toolbox. The experimental results are the average values obtained by running the algorithms ten times on corresponding datasets.

The Synthetic Datasets
Description. The purpose of using synthetic datasets is to examine whether the proposed tensor CP decomposition clustering framework can work well, since the detailed cluster structures of the synthetic datasets are known. In order to make the synthetic datasets similar to a realistic situation, we assume that the distribution for different types of objects that appear in a gene-network follows Zipf 's law (see details online: https://en.wikipedia.org/wiki/Zipf 's_law). Zipf 's law is defined by ( ; , ) = − / ∑ =1 − , where is the number of objects, is the object index, and is the parameter characterizing the distribution. Zipf 's law denotes the frequency of the th object appearing in the gene-network. We set = 0.95 and generate 4 synthetic datasets with different parameters. The details of these synthetic datasets are shown in Table 1.

Experimental Results.
In the beginning of the experiments, we set a common learning rate, = 1/(iter + 1), for SGDClus and SOSClus. We find that SOSClus has a faster convergence speed and better robustness with respect to the learning rate, which is clearly shown in Figure 3. Although SGDClus may eventually converge to a local minimum, the efficiency of the optimization near a local minimum is not all roses. As shown in Figure 3, the solutions of SGDClus swing around a local minimum. This phenomenon proves that the convergence speed of SGDClus is sensitive to the choice of learning rate . Then, we modify the learning rate to = 1/(iter + ) for SGDClus, where is a constant optimized in the experiments. In practice, = 27855 for SGDClus running on Syn3 and = 430245 for SGDClus running on Syn4. The performance comparison of SGDClus with learning rate = 1/(iter + ) and SOSClus with learning rate = 1/(iter + 1) on Syn3 and Syn4 is shown in Figure 4. By employing the optimized learning rate, SGDClus converges to a local minimum quickly. However, compared to SOSClus, SGDClus still has no advantage. The hand drawing blue circles over the curve of SOSClus in Figure 4 shows that SOSClus can escape from a local minimum and find the global minimum, while SGDClus just obtains the first reaching local minimum.
According to (23) and (24), we accessorily obtain the solutions of OPT by running SOSClus. So, we compare the AC and NMI of OPT, SGDClus, and SOSClus on the 4 synthetic datasets, which are shown in Figure 5. With the increase of object types in the heterogeneous information networks, the AC and NMI of SOSClus and OPT increase distinctly, while performance of SGDClus almost does not change. When = 2, AC and NMI of these three methods on Syn1 and Syn2 are almost equal and low. However, AC and NMI of SOSClus increase to 1 when = 4. Since the histograms of OPT, SGD-Clus, and SOSClus on Syn1 and Syn2 are almost the same and on Syn3 and Syn4 are also similar, we know that the parameters and have no significant effect on the performance. Generally, the larger density and the number of object types in the network result in higher AC and NMI of SOSClus.
Obviously, in the experiments on the 4 synthetic datasets, SOSClus shows an excellent performance. SOSClus has a faster convergence speed and better robustness with respect to the learning rate. Meanwhile, SOSClus performs better on AC and NMI, because it can escape from a local minimum and find the global minimum.

Real-World Dataset Description.
The experiments on the real-world dataset are used to compare the performance of the tensor CP decomposition clustering framework with other state-of-the-art methods.
The real-world dataset extracted from the DBLP database is the DBLP-four-area dataset, which can be downloaded from http://web.cs.ucla.edu/∼yzsun/data/DBLP_four_area .zip. It is a four research-area subset of DBLP and is used in [2,3,12,13,15,16,18]. The four research areas in DBLPfour-area dataset are database (DB), data mining (DM), machine learning (ML), and information retrieval (IR), respectively. There are five representative conferences in each area. And all related authors, papers published in these conferences, and terms contained in these papers' titles are included. The DBLP-four-area dataset contains 14,376 papers with 100 labeled, 14,475 authors with 4,057 labeled, 20 labeled conferences, and 8,920 terms. The density of the DBLP-four-area dataset is 9.01935 × 10 −9 , so we construct a 4-mode tensor with size of 14,376 × 14,475 × 20 × 8,920 and 334832 nonzero elements. We compare the performance of tensor CP decomposition clustering framework with several other methods on the labeled record in this dataset.

Comparative Methods
(i) NetClus (see [2]). An extended version of RankClus [1], which can deal with the network, follows the star network schema. The time complexity of NetClus for clustering each  object type in each iteration is ( | | + ( 2 + ) ), where is the number of clusters, | | is the number of edges in the network, and is the total number of objects in the network.
(ii) PathSelClus (see [15,16]). A clustering method based on the predefined metapath requires a user guide. In PathSel-Clus, the distance between the same type objects is measured by PathSim [3], and the method starts with the given seeds by user. The time complexity of PathSelClus for clustering each object type in each iteration is (( + 1)|P| + ), where |P| is the number of metapath instances in the network. And, the time complexity of PathSim used by PathSelClus for clustering each object type is ( ), where is the average degree of objects.
(iii) FctClus (see [13]). It is a recently proposed clustering method for heterogeneous information networks. As with NetClus, the FctClus method can deal with networks following the star network schema. The time complexity of FctClus for clustering each object type in each iteration is ( | | + ).

Experimental Results.
As the baseline methods can only deal with specific schema heterogeneous information networks, here we must construct different subnetworks for them. For NetClus and FctClus, the network is organized as a star network schema like in [2,13], where the paper (P) is the centre type, and author (A), conference (C), and term (T) are   the attribute types. For PathSelClus, we select the metapath of P-T-P, A-P-C-P-A, and C-P-T-P-C to cluster the papers, authors, and conferences, respectively. And in PathSelClus, we give each cluster one seed to start. We model the DBLP-four-area dataset as a 4-mode tensor, where each mode represents one object type. The 4 modes are author (A), paper (P), conference (C), and term (T), respectively. Actually, the sequence of the object types is insignificant. And each element of the tensor represents a gene-network in the heterogeneous information network. In the experiments, we set the learning rate for SOSClus to be = 1/(iter +1) and an optimized learning rate = 1/(iter + ) with = 1000125 for SGDClus. By running SOSClus, we accessorily obtain the solutions of OPT. So, we compare the experimental results of OPT, SGDClus, and SOSClus on the DBLP-four-area dataset with the three baseline methods. See the details in Tables 2, 3, and 4.
In Tables 2 and 3, SOSClus performs best on average AC and NMI, and SGDClus takes the second place. All methods achieve satisfactory AC and NMI on the conference, since there are only 20 conferences in the network. SGDClus takes the shortest running time, and OPT and SOSClus have an obvious advantage on running time compared with other baselines. The time complexity of SGDClus is ( (X) + ( − 1) 2 ), and the time complexity of SOSClus is ( (X) +( −1) 2 + 3 ), where (X) is the number of nonzero elements in X, that is, the number of genenetworks in the heterogeneous information network. We have z(X) < |P| ≪ | |, ≪ , and ≪ . Compared with the time complexity of the three baselines, SGDClus and SOSClus have a little disadvantage. However, it is worth noting that the three baselines can only cluster one type of objects in the network in each running, while OPT, SGDClus, and SOSClus can obtain the clusters of all types of objects simultaneously by running once. This is the reason why only the total time is shown for OPT, SGDClus, and SOSClus in Table 4. Moreover, the total time of OPT, SGDClus, and SOSClus is on the same order of magnitude as the running time of other baselines for clustering each type of objects, which is consistent with the comparison of the time complexity.

Conclusion
In this paper, a tensor CP decomposition method for clustering heterogeneous information networks is presented. In tensor CP decomposition clustering framework, each type of objects in heterogeneous information network is modeled as one mode of tensor, and the gene-networks in the network are modeled as the elements in tensor. In other words, tensor CP decomposition clustering framework can model different types of objects and semantic relations in the heterogeneous information network without the restriction of network schema. In addition, two stochastic gradient descent algorithms, named SGDClus and SOSClus, are designed. SGDClus and SOSClus can cluster all types of objects and the gene-networks simultaneously by running once. The proposed algorithms outperformed other state-of-the-art clustering methods in terms of AC, NMI, and running time.

Conflicts of Interest
The authors declare that they have no conflicts of interest.