^{1}

^{1}

^{1}

^{2}

^{3}

^{4}

^{1}

^{2}

^{3}

^{4}

Identification of community structures and the underlying semantic characteristics of communities are essential tasks in complex network analysis. However, most methods proposed so far are typically only applicable to assortative community structures, that is, more links within communities and fewer links between different communities, which ignore the rich diversity of community regularities in real networks. In addition, the node attributes that provide rich semantics information of communities and networks can facilitate in-depth community detection of structural information. In this paper, we propose a novel unified Bayesian generative model to detect generalized communities and provide semantic descriptions simultaneously by combining network topology and node attributes. The proposed model is composed of two closely correlated parts by a transition matrix; we first apply the concept of a mixture model to describe network regularities and then adjust the classic Latent Dirichlet Allocation (LDA) topic model to identify community semantically. Thus, the model can detect broad types of network structure regularities, including assortative structures, disassortative structures, and mixture structures and provide multiple semantic descriptions for the communities. To optimize the objective function of the model, we use an effective Gibbs sampling algorithm. Experiments on a number of synthetic and real networks show that our model has superior performance compared with some baselines on community detection.

With the advent of the era of big data and the diverse channels for acquiring data, we have obtained a large amount of data from complex systems in the real world [

At present, exploring the structural regularities and functions of the network is a significant part of complex network analysis [

However, most conventional community detection methods only consider the network structure but ignore the attributes of nodes. In fact, the attributes of nodes help to improve the performance of community detection, because nodes with similar attributes tend to belong to the same community [

Most methods that have been proposed for community detection are typically only appropriate for assortative community structure; i.e., the nodes within a community are densely connected [

In particular, although node attributes may carry essential semantic information of communities, there are few ways to detect generalized communities, that is, detecting broad types of network structural regularities and combining network structures and attributes. Chen et al. [

As a result, considering the rich diversity of community regularities in real networks, nodes attributes can not only improve the quality of generalized community detection but also identify the latent semantic characteristics of communities, identify the generalized communities, and provide semantic descriptions, which are worth studying in the complex network analysis. All the above methods neglect solving this twofold problem. Instead, we propose a unified generative model to detect communities in a wide variety of network structures without any prior knowledge of the certain type of intrinsic regularities in the networks. We also derive the semantic descriptions of the communities by combining the network structure and attributes at the same time. Our model is composed of two closely related parts by a probability transition matrix. The first is the topology part in which communities are described based on a mixture model, assuming that nodes in the same groups have similar link patterns (no matter whether there are more links within the communities or between communities). The second is the attribute part, in which semantic information is identified by the classic topic model (LDA) [

In summary, the contributions of this paper are as follows:

As we know, it is the first time we propose the generalized community in the attribute networks, in which the nodes have some link patterns with others and semantic similarity in the network

We propose a unified generation model to analyze the attribute networks and detect the generalized community structure as well as its semantic description; it can describe the internal relationship between topological structure and node attribute of the network

We also develop an effective Gibbs sampling algorithm and experiments show its better performance compared with some baselines

To explore the network structural regularities, some methods for detecting generalized communities have been proposed. Recently, node attributes have attracted extensive attention in the complex network analysis.

Newman and Leicht [

There were several methods for content analysis, such as Latent Dirichlet Model (LDA) [

In this section, we give a formal description of the proposed model, i.e., Generalized Semantic Community (GSC) identification, with the purpose of generalized community detection and semantic identification in the networks.

We define an attributed network

Observed quantities: the number of groups

Latent quantities: group labels

Model parameters:

Table

Notations.

Type | Symbol | Description |
---|---|---|

Observed quantities | Adjacency matrix | |

Attribute matrix | ||

Number of communities and topics | ||

Latent quantities | Community assignment of node | |

Topic labels of the node | ||

Model parameters | The fraction of nodes in community | |

The probability that a certain node in community | ||

The probability that node | ||

The probability that the | ||

Hyperparameters | Acting as noninformation prior of corresponding model parameters with prior distribution |

Considering the rich diversity of community regularities in real networks, encoding network structure and node attributes simultaneously, and providing the semantic descriptions of the resultant network communities are still the problems that are worth studying in the community detection. However, most existing methods tend to ignore certain aspects of the problems that remain the challenges of current community detection. Given an attributed network, the goal of handling these problems is twofold:

How to divide the nodes into communities and content clusters no matter what kind of network structural regularity the network is?

How to identify the correlations between communities and attribute topics to provide the best semantic descriptions of communities?

So the problem can be formalized as, given the adjacency matrix

To achieve the objective, we define a unified Bayesian probabilistic generative model to handle topologies and node attributes at the same time. Our goal is to divide the nodes in networks with extensive structural regularities into

Sample

The graphical representation of model.

For each community

Sample

Sample

For each topic

Sample

For each new node

Sample a latent group assignment

For each node

Sample edge

For each of the

Sample

Sample attribute

We introduce a Bayesian treatment into the model generation process. After the number of communities

We use Dirichlet distributions to generate the following model parameters, respectively:

At first, we sample the latent community membership

After the latent community membership

As

We generate attributes

Then, the probability of the network

It is subject to

To exactly infer that the latent variables

Because the Dirichlet and Multinomial distributions are conjugate, equation (

The inference process is in Algorithm

0: initialize

Initialize each node’s latent community label

(1)//sampling

(2)for

(3)

(4) //get the current community assignment of node

(5) update

(6)

(7) compute probability

(8)

(9) Gibbs sampling for

(10) update

(11)

(12) //get the current topic assignment of attribute

(13) update

(14)

(15) compute probability

(16)

(17) Gibbs sampling for

(18) update

(19)

(20)

(21) slice sampling for

(22)

For each node

For node

Our model can also only handle edges or nodes’ attributes in the networks.

The probability of only considering the links can be written as

The community probability of node

The probability of only considering the attributes can be written as

The community probability of node

The topic probability of the attribute

Firstly, we experiment on three different synthetic networks with different structure regularities (i.e., assortative, disassortative, and mixture structures) to evaluate the quality of community detection and analyze the superiority of modeling on the network with a rich diversity of structures. Then, we assess the interpretability of communities in an online music system. Finally, we evaluate on real networks and do a comparison with state-of-the-art methods.

As the ground truth of communities in the networks is known, we use the following Normalized Mutual Information (NMI) [

To describe parameter estimation in GSC more adequately, we describe the changing trend of likelihood function with the number of iterations in Figure

(a) Trend of the log-likelihood probability of Cora with iterations. (b) Trend of the log-likelihood probability of Cora with

Firstly, we conduct experiments on synthetic networks to evaluate the quality of community detection. Then, we assess on real networks and do a comparison with state-of-the-art methods.

The first synthetic network is a random network in Newman’s method [

We set

(a) The node attributes’ matrix. (b) The community attributes’ matrix.

The value of NMI of three methods on random networks. (a) _{out}. (b) _{out}.

The second synthetic network is Newman’s model [

In particular, each community has a unique signature set of keystones, and only the link pattern to keystones can identify the community; thus the structure of this network is neither assortative nor disassortative.

At first, we study the influence of noise attributes on community detection.

The node attributes’ matrix with different

Communities detected by GSC and NEMBP models on two synthetic networks. (a) The real community assignment of the synthetic network of 108 nodes. (b) The result of community detection by GSC-link. (c) The result of community detection by GSC,

The value of NMI of four methods: (a) on Synthetic 108 with the change of

In this network, the propensity to link to the unique set of keystone nodes determines the group membership. We change the keystone links of each group to change the network structure by varying the keystone links of each group from 100 to 10 with a decrement of 10. We set the probability of noisy attributes

The third network [

In this part, we evaluate the efficiency of community detection methods by measuring each method’s running time on synthetic networks as we increase the network size. The comparison methods are NEMBP and SCI. The synthetic networks include assortative and disassortative structures. The edges are placed uniformly at random within and between communities in certain numbers. The number of edges within each community is set to 1,200 and the number of edges between a community and the others is set to 600. They form a community structure. The rest of the communities are divided in pairs, the number of edges between two communities in each pair is set to 2,400, and the number of edges between communities in different pairs is set to 1,200. Each pair of groups forms a bipartite structure. The maximum number of nodes in our synthetic network is 7,000, including 12,6000 edges and 700 attributes. We change the scale of the network (Syn-100, Syn-500, Syn-1000, Syn-2000, Syn-3000, Syn-5000, and Syn-7000). The synthetic network of 100 is the third network that we used above. For each synthetic network, we generate

Figure

The running time on synthetic networks: (a) GSC on Synthetic networks with the change of nodes from 100 to 7000; (b) NEMBP and SCI on Synthetic networks with the change of nodes from 100 to 2000.

In this paper, we use

The examples show the word clouds of the main attributes of communities. The sizes of word indicate the probability that they belong to a topic.

The first example in Figure

Cora, Citeseer, Terrorist, and Biology are four real networks with both links and contents that we apply in this paper. Cora is a part of Cora citation networks, including 2,708 published articles and 5,429 edges. Each publication is represented by a 1,433-dimensional binary word vector which means the absence or presence of the relating words. The total publications are divided into seven communities. Citeseer is a subset of Citeseer citation networks. It includes 3,312 published articles and 4,732 edges. Each publication is represented by a 3,703-dimensional binary word vector. The total publications are divided into six communities. The Terrorist dataset consists of 1,293 terrorist attacks; each attack is assigned one of 6 labels indicating the type of the attack. Each attack is described by a 106-dimensional binary word vector whose entries indicate the absence or presence of a feature. Biology is a real paper citation network, which is from 435 different biological journals. It contains 10,000 papers connected by links. Each paper is described by a 9,944 0/1-valued keyword vector; two papers are connected if they have a reference relationship. There are 435 nodes representing different biological journals in the network; each paper links to them according to the journal in which it is published. So, the network forms a mixture structure that is similar to the synthetic network of 108. All the papers are split into 435 groups; each group contains papers published in a certain journal. We also use Syn-2000, which includes both community and bipartite structure. The five networks are shown in Table

Statistical characteristics of five networks.

Datasets | Type | ||||
---|---|---|---|---|---|

Cora | 2708 | 5429 | 1433 | 7 | Community |

Citeseer | 3312 | 4732 | 3703 | 6 | Community |

Terrorist | 1293 | 3172 | 106 | 6 | Community |

Biology | 10000 | 22662 | 9944 | 435 | Mixture |

Syn-2000 | 2000 | 36000 | 200 | 20 | Mixture |

We compare our GSC model with the methods from three categories: (1) models based on only network structures, that is, GSC-link; (2) models based on only network attributes, such as GSC-attr and LDA; (3) models based on both structures and attributes, such as PCL-DC, NMMA, SCI, and NEMBP.

The results of these models on three networks are shown in Table

NMI (%) of different models on five networks with node attributes.

Datasets | Type | |||||
---|---|---|---|---|---|---|

Cora | Citeseer | Terrorist | Biology | Syn-2000 | ||

GSC-link | 16.33 | 4.696 | 1.67 | 26.43 | 95.86 | Link |

GSC-attr | 28.86 | 24.31 | 30.03 | 4.28 | 90.26 | Attr |

LDA | 14.61 | 9.13 | 5.42 | 89.60 | Attr | |

PCL-DC | 17.54 | 2.99 | 5.32 | 3.29 | 88.32 | Link + attr |

NMMA | 41.57 | 25.59 | 6.86 | 94.28 | Link + attr | |

SCI | 19.26 | 4.87 | 8.73 | N/A | 81.57 | Link + attr |

NEMBP | 44.08 | 24.27 | 9.37 | N/A | 78.68 | Link + attr |

GSC | 25.13 | 30.45 | Link + attr |

In this paper, we propose a novel Bayesian probability model to detect generalized communities and identify the semantics combining network structures and nodes attributes and use an efficient Gibbs sampling algorithm to optimize the objective function. Even if the information of node attributes is of poor quality, our method can use the complementary structural information in node attributes to get better results. The model assumes that the network structure and node attributes have different hidden variables and adopts a transition matrix to explore the hidden correlation between communities and topics. Thus, it can provide semantic descriptions of communities to better reveal the characteristics of communities. We evaluate our method on a number of real and synthetic datasets and in a case study. The new method can detect various types of network structures and outperforms several state-of-the-art algorithms.

It is similar to the proposed methods in requiring that the number of communities be provided. This problem is about model selection issue, and we will focus on determining group number automatically in the next step.

The datasets used to support the results of this study are available from the corresponding author upon request.

The authors declare that there are no conflicts of interest regarding the publication of this paper.

This work was supported by the National Natural Science Foundation of China (61902278), the National Key R&D Program of China (2018YFC0832101), the Livelihood Science and Technology Project of Qingdao (18-6-1-106-nsh), and the National Social Science Foundation of China (15BGL035).