^{1}

^{1}

^{2}

^{3}

^{4}

^{1}

^{1}

^{2}

^{3}

^{4}

Community discovery can discover the community structure in a network, and it provides consumers with personalized services and information pushing. It plays an important role in promoting the intelligence of the network society. Most community networks have a community structure whose vertices are gathered into groups which is significant for network data mining and identification. Existing community detection methods explore the original network topology, but they do not make the full use of the inherent semantic information on nodes, e.g., node attributes. To solve the problem, we explore networks by considering both the original network topology and inherent community structures. In this paper, we propose a novel nonnegative matrix factorization (NMF) model that is divided into two parts, the community structure matrix and the node attribute matrix, and we present a matrix updating method to deal with the nonnegative matrix factorization optimization problem. NMF can achieve large-scale multidimensional data reduction processing to discover the internal relationships between networks and find the degree of network association. The community structure matrix that we proposed provides more information about the network structure by considering the relationships between nodes that connect directly or share similar neighboring nodes. The use of node attributes provides a semantic interpretation for the community structure. We conduct experiments on attributed graph datasets with overlapping and nonoverlapping communities. The results of the experiments show that the performances of the F1-Score and Jaccard-Similarity in the overlapping community and the performances of normalized mutual information (NMI) and accuracy (AC) in the nonoverlapping community are significantly improved. Our proposed model achieves significant improvements in terms of its accuracy and relevance compared with the state-of-the-art approaches.

The science of networks is a modern discipline spanning the natural, social, and computer sciences, as well as engineering. There are different kinds of networks in the real world, such as citation networks, social networks, and collaboration networks [

Most existing community detection algorithms analyze the network by using the original network topology information [

However, for some networks, there are not only network topology information but also node attribute information that is a semantic interpretation of the community structure. For example, papers in citation networks contain titles, abstracts, and keywords that may be represented using binary-valued vectors. We binarize the categorical input so that they can be thought of as vectors in Euclidean space (we call this embedding the vector in Euclidean space).

Such networks with node attributes are named attributed graphs [

Nonnegative matrix factorization (NMF) is an effective method in community detection. Some scholars have studied it. Luo et al. [

The state-of-the-art methods (e.g., CDE [

Scholars have proposed some important methods for large-scale community detection such as neighborhood, maximal subgraph, intimate degree, and core-vertices [

We propose a novel method that generates the community structure matrix, which retains the relationship between two nodes that are directly connected or share the same neighbors

We combine node attribute information and community structure information in an effective way. Then, we propose our method, named Community Detection with Community Structure and Node Attributes (CDCN), to identify the network communities with semantic annotation and community structures using nonnegative matrix factorization framework [

Extensive experiments were conducted on public datasets to demonstrate the effectiveness of CDCN, and its accuracy and performance were better than those of the state-of-the-art methods

The remainder of the paper is organized as follows. Section

This section will briefly summarize the three different types of community detection models that use different information to determine the network information. We briefly summarize the three types of community detection models in Table

A brief summarization of community detection.

Types | Representative works |
---|---|

Original network topology | GN [ |

Node attributes | CAN [ |

Both original network topology and node attributes | COMODO [ |

The first type of community detection method focuses on the original network topology. GN [

However, the above community detection methods directly utilized the original network topology and ignore the inherent community structures (e.g., node attributes). The missing and meaningless information in the network topology often leads to poor results. Therefore, the second type of method focuses on node attributes, and it includes some classical or state-of-the-art clustering methods. Strictly speaking, those methods are not community methods, but they could use node attributes information to discover communities. Thus, in this paper, we also regard them as related work. CAN [

The third type of community detection method considers both the original network topology and node attribute information. Several algorithms that consider both structural and attribute information have been proposed in [

Nonnegative matrix factorization (NMF) [

Different from all those methods, we combine node attribute information and community structure information by generating the community structure matrix, which retains the relationship between two directly connected nodes or nodes that share the same neighbors. This method can more accurately find the relationships between networks. The topology diagram of the three methods can be seen in Figure

The community structure matrix of different methods.

Using adjacency matrix

Embedding matrix of CDE

Community structure matrix of our models

Figure

In this section, we propose a novel algorithm for community detection that combines the community structure and node attribute information. We will introduce the community structure, node attributes, the overall model, and the algorithm in detail.

The community structure part models the network structure. Given an undirected network

As one of the state-of-the-art methods, SCI directly regards the binary-valued adjacency matrix

We start with a concise and reasonable observation regarding whether two nodes belong to the same community. They may be surrounded by the similar environment that means the two nodes may share the similar neighbor nodes.

Therefore, we measure the similarity in a community memberships as follows:

According to the similarity of the node community member ships, we could get a community structure matrix

In our daily lives, if two people like the same movie, can we think that they share similar hobbies? That may be right but insufficient if the movie is popular and everyone would like it. Thus, if the movie is unfashionable, we could be sure that the two people share similar hobbies. Obviously, this is also suitable for community detection. If two nodes have a same cold neighboring node (few nodes are connected to it), it will make a great contribution to the similarity between the two nodes. In other words, we should add a penalty to the hot neighboring nodes (many nodes are connected to it). By replacing

Figure

We define

We define

The node attribute matrix

In this way, we can use the node attribute information to divide the communities.

In this subsection, we will elaborate the overall model of our CDCN method. There are two parts of our method, which include the community structure part and the node attribute part.

We combine the community structure part in equation (

To make the model more generalized, we do not strictly restrict the symmetric decomposition of

In practice, it is common to set

In summary, we use the same variable

As for the detection of overlapping communities, we identify that node

In this subsection, we will share the solution algorithm for our proposed model. The learning process algorithm for CDCN can be seen in Algorithm

Begin

According to network graph

End While.

End

The matrix

Based on this, the updating rules for the variables are given as follows:

The algorithm for the optimization (

In this section, we have conducted extensive comparative experiments to evaluate the effectiveness of our proposed CDCN model on real graph datasets with ground-truth communities.

We consider 7 widely accepted network ground-truth community datasets, i.e., Karate, Polbooks, Football (

Dataset statistics.

Datasets | ||||
---|---|---|---|---|

Karate | 34 | 156 | — | 2 |

Polbooks | 105 | 882 | — | 3 |

Football | 115 | 1226 | — | 12 |

Cornell | 195 | 283 | 1703 | 5 |

Texas | 187 | 280 | 1703 | 5 |

Washington | 230 | 366 | 1703 | 5 |

Wisconsin | 265 | 459 | 1703 | 5 |

Citeseer | 3312 | 4536 | 3703 | 6 |

4039 | 88243 | 10 | 193 | |

HEP-TH | 20048 | 236230 | 300 | 537 |

The node attributes of Cornell, Texas, Washington, Wisconsin, Citeseer, and Facebook are binary vectors where the elements are either 0 or 1. The node attributes of HEP-TH are dense vector with a dimension of 300. We extract the paper titles and abstracts and then the train word vector model [

We compare different methods in these networks to prove the effectiveness of our community structure matrix. The detailed information of the datasets can be seen in Table

The compared methods may include nonoverlapping and overlapping communities, and so we choose different evaluation metrics.

For nonoverlapping communities:

In terms of the measures to evaluate the quality of nonoverlapping communities, we use two evaluation metrics. We adopt the same evaluation procedure used in [

The first metric is the accuracy (AC [

The second metric is the normalized mutual information (NMI [

For overlapping communities:

We compare a set of detected communities

In this section, we perform the parameter sensitivity analysis of CDCN on the Wisconsin and Washington dataset. The number of nodes in these two is appropriate, which makes it easier to see the effect. Our algorithm has two hyperparameters:

Figures

Fixing

Fixing

Figures

Fixing

Fixing

We compared our algorithm against six topology based methods, i.e., SNMF, SLPA, DEMON, CPM, Louvain, and InfoMap; three node attributes based methods, i.e., CAN, SMR, and NC; and three methods that consider both network topologies and node attributes, i.e., PCL-DC, SCI, and CDE.

As with the experiments in [

For all baseline methods, we set their parameters by default to achieve the best results for those methods. For example, for CDE, we set

In this subsection, we evaluate the results on nonoverlapping communities. We report the ACs and NMIs of all methods in Table

The performances of different community detection algorithms on nonoverlapping communities measured by the AC and NMI.

Metric | Datasets | InfoMap | CPM | SLPA | Louvain | DEMON | SNMF | CAN | SMR | NC | PLC-DC | SCI | CDE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

AC | Cornell | 0.2 | 0.446 | 0.239 | 0.266 | 0.377 | 0.371 | 0.446 | 0.415 | 0.317 | 0.348 | 0.354 | 0.449 | 0.569 |

Texas | 0.214 | 0.471 | 0.523 | 0.269 | 0.475 | 0.496 | 0.545 | 0.470 | 0.540 | 0.369 | 0.488 | 0.498 | 0.641 | |

Washington | 0.143 | 0.469 | 0.434 | 0.204 | 0.381 | 0.410 | 0.491 | 0.508 | 0.456 | 0.408 | 0.401 | 0.568 | 0.695 | |

Wisconsin | 0.152 | 0.471 | 0.262 | 0.223 | 0.430 | 0.386 | 0.471 | 0.471 | 0.422 | 0.354 | 0.396 | 0.645 | 0.694 | |

Citeseer | 0.053 | 0.178 | 0.090 | 0.238 | 0.208 | 0.309 | 0.212 | 0.211 | 0.314 | 0.452 | 0.327 | 0.474 | 0.544 | |

HEP-TH | 0.041 | 0.123 | 0.091 | 0.135 | 0.113 | 0.168 | 0.135 | 0.162 | 0.143 | 0.193 | 0.184 | 0.201 | 0.215 | |

NMI | Cornell | 0.147 | 0.05 | 0.138 | 0.109 | 0.051 | 0.061 | 0.045 | 0.061 | 0.084 | 0.081 | 0.073 | 0.311 | 0.358 |

Texas | 0.102 | 0.077 | 0.040 | 0.059 | 0.043 | 0.097 | 0.021 | 0.09 | 0.115 | 0.068 | 0.097 | 0.252 | 0.328 | |

Washington | 0.117 | 0.006 | 0.158 | 0.087 | 0.079 | 0.035 | 0.046 | 0.117 | 0.038 | 0.103 | 0.086 | 0.341 | 0.406 | |

Wisconsin | 0.111 | 0.027 | 0.106 | 0.078 | 0.047 | 0.070 | 0.037 | 0.070 | 0.077 | 0.071 | 0.069 | 0.406 | 0.427 | |

Citeseer | 0.214 | 0.199 | 0.214 | 0.228 | 0.166 | 0.096 | 0.003 | 0.007 | 0.003 | 0.221 | 0.083 | 0.208 | 0.263 | |

HEP-TH | 0.035 | 0.092 | 0.084 | 0.112 | 0.093 | 0.137 | 0.114 | 0.123 | 0.114 | 0.152 | 0.145 | 0.179 | 0.191 |

The baseline comparison methods include InfoMap, CPM, SLPA, Louvain, DEMON, SNMF, CAN, SMR, NC, PCL-DC, SCI, and CDE. The real datasets include Cornell, Texas, Washington, Wisconsin, and Citeseer. All the datasets are independent of each other, and there is no connection between them; therefore, they are nonoverlapping communities. We apply our method to the above datasets by using the different baseline methods.

Compared with the algorithms that focus on the original network topology or node attributes, the results show that combining both the original network topology and the inherent community structure information together will result in making great improvements. For example, among the algorithms that focus on the original network topology or node attributes, the highest ACs on Cornell, Texas, Washington, Wisconsin, and Citeseer are, respectively, 0.446, 0.545, 0.508, 0.471, 0.314, and 0.168, and ACs on our methods are 0.569, 0.641, 0.695, 0.694, 0.544, and 0.215, respectively. The values increased by 0.123, 0.096, 0.187, 0.223, 0.230, and 0.047, respectively. The same as the AC, our method also greatly improved the NMI.

The experiments results can be seen in Table

Some scholars have proposed effective methods in the detection of overlapping communities, such as subspace decomposition, maximal cliques, maximal subgraph, and the clustering coefficient. Li et al. [

For the overlapping communities, we use the F1-Score and Jaccard-Similarity to evaluate the partitioned results of all the methods, except the clustering methods that could not discover the overlapping communities. The tested network is the complete Facebook data, and it contains 10 different ego-networks with manually identified circles. We select 4 representative ego-networks from them. The experiment results can be seen in Table

The performances of different community detection algorithms on overlapping communities measured by the F1-Score and Jaccard-Similarity.

Metric | Datasets | InfoMap | SLPA | Louvain | DEMON | SNMF | PCL-DC | SCI | CDE | CDCN |
---|---|---|---|---|---|---|---|---|---|---|

F1-Score | FaceBook ego-network 107 | 0.448 | 0.510 | 0.264 | 0.517 | 0.378 | 0.384 | 0.405 | 0.474 | 0.539 |

FaceBook ego-network 698 | 0.636 | 0.628 | 0.588 | 0.576 | 0.612 | 0.345 | 0.239 | 0.574 | 0.640 | |

FaceBook ego-network 1912 | 0.372 | 0.323 | 0.366 | 0.328 | 0.378 | 0.312 | 0.316 | 0.322 | 0.379 | |

FaceBook ego-network 3908 | 0.579 | 0.528 | 0.567 | 0.387 | 0.410 | 0.421 | 0.388 | 0.471 | 0.580 | |

0.330 | 0.351 | 0.321 | 0.214 | 0.134 | 0.224 | 0.213 | 0.324 | 0.372 | ||

Jaccard-Similarity | FaceBook ego-network 107 | 0.372 | 0.410 | 0.205 | 0.421 | 0.267 | 0.304 | 0.294 | 0.369 | 0.432 |

FaceBook ego-network 698 | 0.556 | 0.529 | 0.482 | 0.464 | 0.487 | 0.256 | 0.141 | 0.441 | 0.571 | |

FaceBook ego-network 1912 | 0.286 | 0.272 | 0.251 | 0.229 | 0.249 | 0.200 | 0.202 | 0.215 | 0.290 | |

FaceBook ego-network 3908 | 0.432 | 0.441 | 0.421 | 0.313 | 0.287 | 0.310 | 0.268 | 0.352 | 0.441 | |

0.206 | 0.213 | 0.242 | 0.178 | 0.169 | 0.182 | 0.168 | 0.224 | 0.262 |

The baseline comparison methods include InfoMap, CPM, SLPA, Louvain, DEMON, SNMF, PCL-DC, SCI, and CDE. The real datasets include FaceBook Ego-network 107, FaceBook Ego-network 698, FaceBook Ego-network 1912, FaceBook Ego-network 3908, and FaceBook. There are some intersections between datasets, and some overlapping areas appear; therefore, there are overlapping communities. We apply our method to the above datasets by using the different baseline methods.

Ego-network 107 has the most nodes, Ego-network 698 has the fewest nodes, Ego-network 1912 has the highest intensive degree, and Ego-network 3908 has lowest intensive degree.

As shown in Table

There are some communities without node attributes, and it is hard to divide these communities using the majority methods that use node attributes. However, it is easy to deal with the problem using CDCN since we could use only the community structure part of our method. Therefore, our method will be simplified as follows:

The update formulas are changed into the following:

To prove the usefulness of our community structure matrix, we add two more baselines, which are called Adj-Mat and Emb-Mat. Adj-Mat just replaces the community structure matrix

Begin

According to the network graph

Update ^{(i+1)} according to Eq.(

Update ^{(i+1)} according to Eq.(

End While

End

The baseline comparison methods include InfoMap, CPM, SLPA, Louvain, DEMON, SNMF, Adj-Mat, and Emb-Mat. We assessed the NMI and AC values on the karate, football, and polbooks datasets.

In addition, in this part, we compare our method with the methods that focus on the original network topology and that do not node attribute information on four datasets. The four datasets are nonoverlapping communities, and so, we use AC and NMI to evaluate the partitioning result of all the methods. The results can be seen in Table

The performances of the different community detection algorithms on nonoverlapping as community measured by the AC and NMI.

Metric | Datasets | InfoMap | CPM | SLPA | Louvain | DEMON | SNMF | Adj-Mat | Emb-Mat | CDCN |
---|---|---|---|---|---|---|---|---|---|---|

NMI | Karate | 0.581 | 0.652 | 0.658 | 0.569 | 0.429 | 0.836 | 0.825 | 0.912 | 1.0 |

Football | 0.901 | 0.855 | 0.582 | 0.713 | 0.463 | 0.894 | 0.875 | 0.903 | 0.909 | |

Polbooks | 0.412 | 0.538 | 0.498 | 0.547 | 0.383 | 0.508 | 0.487 | 0.524 | 0.570 | |

AC | Karate | 0.894 | 0.888 | 0.820 | 0.741 | 0.688 | 0.970 | 0.962 | 0.970 | 1.0 |

Football | 0.918 | 0.897 | 0.613 | 0.716 | 0.546 | 0.891 | 0.867 | 0.911 | 0.923 | |

Polbooks | 0.695 | 0.819 | 0.785 | 0.823 | 0.740 | 0.743 | 0.724 | 0.798 | 0.838 |

As shown in Table

Community detection has been widely used in recommendation systems, social networks, and network security. Efficient and fast community detection algorithms contribute to the development of intelligent networks. Based on the analysis of the network characteristics, in this paper, in order to solve the problem of community detection in attributed graphs, we propose a novel method to generate the community structure matrix, which retains the relationship between two directly nodes connected or nodes that share the same neighbors, and named it CDCN. We combine node attribute information and community structure information in an effective way in the nonnegative matrix factorization framework. We used two indicators named AC and NMI on nonoverlapping communities and two indicators named F1-Score and Jaccard-Similarity on overlapping communities to evaluate our method. On nonoverlapping communities, the AC and NMI values of CDCN are better than those of other methods. On overlapping communities, the F1-score and Jaccard-Similarity value of CDCN are better than those of other methods. The extensive experimental results demonstrated that our algorithm can effectively discover the communities in real networks.

The original dataset used in this work is available from the corresponding author on request.

The authors declare that they have no conflicts of interest.

This work is supported partly by the National Key Research and Development Program of China under Grant No. 2017YFB1400200 and the Science and Technology Plan in Key Fields of Yunnan under Grant No. 202001BB050076.