Community structure plays a key role in analyzing network features and helping people to dig out valuable hidden information. However, how to discover the hidden community structures is one of the biggest challenges in social network analysis, especially when the network size swells to a high level. Infomap is a top-class algorithm in nonoverlapping community structure detection. However, it is designed for single processor. When tackling large networks, its limited scalability makes it less effective in fully utilizing server resources. In this paper, based on infomap, we develop a scalable parallel nonoverlapping community detection method, Pinfomr (parallel Infomap with MapReduce), which utilizes the MapReduce framework to solve the two problems. Experiments on artificial networks and real datasets show that our parallel method has satisfying performance and scalability.
A few common properties in many complex networks have been discovered: small-world property, scale-free feature, and community structure pattern [
Current social networks have jumped to millions even billions of nodes [
Network partitioning is NP-complete [
Nowadays, mainstream servers are configured with high performance hardware. Empirical studies [
Information science is shifting from computing-intensive to data-intensive [
Main contributions of this paper are as follows:
The rest of this paper is organized as follows. Section
In this paper, we only study undirected networks, which can be mathematically described as
Infomap is based on information-theory. So some information-theoretic concepts are briefly reviewed here. In information theory, the information contained in a distribution is called entropy. For a discrete random variable
Nodes are assigned to
Partitioning the node set
A network with 5 communities and 3 partitions.
Schematic representation of
An edge whose endpoints are in the same community, that is, intracommunity edge, is called an effective edge. If the endpoints of an effective edge are divided into different partitions, then we call it an effective edge lost. The effective edge lost ratio is the percentage of the effective edge lost divided by the total number of edges in the network.
In Figure
A number of high-quality and computationally efficient graph partitioning methods have been proposed and multilevel graph partitioning algorithms [
From Figure
A schematic diagram of multilevel
In this paper, we continue our work on the information theoretic community detection model-infomap. First, we briefly review the model. It utilizes the duality between compressing a data set and detecting and extracting significant patterns or structures within the data, which is a statistical concept known as minimum description length statistics [
In an undirected network, the random walk has a state
Figure
Random walk and 2-level Huffman coding on a network with two communities.
The result of Figure
Assuming there is an optimal community division,
Obviously, calculating an endless random walk on a network to get
where
With the probability
For the NP-complete challenge, we cannot achieve the global optimal division pattern
For the convenience of illustration, we adopt Figure
In the first stage, we calculate the steady visiting probability of all nodes (shown in Algorithm
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15) (16) (17)
Second, we use multilevel
(1) set (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15) (16) (17) (18) (19) (20) (21) (22)
In the last stage, parallel community detection method is carried out on all partitions (such as the 3 partitions in Figure
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15) (16) (17) (18) (19) (20) (21) (22) (23) (24) (25) (26) (27) (28) (29) (30) (31) (32) (33) (34)
A schematic diagram of MapReduce process for community detection.
In this section, we conduct several experiments and analyze the results. All experiments are running on the Hadoop-1.1.1 cluster of Antivision Software Ltd. The cluster consists of 20 PowerEdge R320 servers (Intel Xeon CPU E5-1410 @2.80 GHz, memory 8 GB) with 64-bit NeoKylin Linux OS, and servers are connected by a Cisco 3750G-48TS-S switch. Data sets are shown in Table
Data sets used in experiments (increment of mix is set to 0.05.
Data set |
|
|
avg |
|
|
Size |
mix |
---|---|---|---|---|---|---|---|
LiveJournal | 3,997,962 | 34,681,189 | / | / | / | / | / |
Youtube | 1,134,890 | 2,987,624 | / | / | / | / | / |
Orkut | 3,072,441 | 117,185,083 | / | / | / | / | / |
D0 | 0.1 M | / | 45 | 2.5 | 1.5 |
|
|
D1 | 0.2 M | / | 40 | 2.5 | 1.5 |
|
|
D2 | 0.4 M | / | 40 | 2.5 | 1.5 |
|
|
D3 | 80,000 | / | 45 | 2.5 | 1.5 |
|
|
D4 | 0.2 M, 0.4 M, | / | 45 | 2.5 | 1.5 |
|
0.45 |
0.8 M, 1.6 M | |||||||
D5 | 3.2 M, 6.4 M, | / | 45 | 2.5 | 1.5 |
|
0.45 |
10 M |
All artificial networks used here are generated by LFR benchmark. In LFR, some parameters give us a direct control on network properties: network size (
In accuracy experiments, we compare our method, Pinfomr, with two top-class methods, Louvain algorithm [
Accuracy and running time tests on different data sets.
Dataset D1 and
Dataset D2 and
Dataset D0 with different
From Figure
In previous section, we have mentioned that the quality of partition will play a vital role in the final performance of parallel community detection. Therefore, we conduct experiment in this section to test the impact and effectiveness of different partitioning methods on Pinfomr.
We use two simple partitioning methods to compare with the improved multilevel
Performance and edge loss ratio of different partition methods on data set D3.
In addition, we conduct a degree distribution test on a real network-LiveJournal to verify performance of the improved multilevel partitioning method. The network is divided into
Degree distribution test on LiveJournal data set.
Degree distribution of LiveJournal
Degree distribution of subnetworks of LiveJournal
Our study aims to uncover community structures in big social networks and improve resource utilization as much as possible. Here, we unify the two problems together by means of MapReduce. With a small portion of expense of performance, we will achieve the goal. In this section, we will test the scalability and performance of the parallel community detection method, and data sets used are D4, D5, LiveJournal (
For a certain network in Figures
Scalability tests on data sets D4 and D5.
Speedup ratio test on D4
Effective edge lost ratio test on D4
Running time test on D5
Finally, we apply the same process onto the real networks. Experiments on real networks shown in Figure
Scalability tests on real networks.
Scalability test on LiveJournal
Scalability test on Youtube
Scalability test on Orkut
Community detection has become an important research topic in social networks. Traditional algorithms on community mining cannot effectively adapt to the current big social network scenarios [
The authors declare that there is no conflict of interests regarding the publication of this paper.
The authors would like to express their sincere gratitude to Zhang Yuchao from Beijing Institute of System Engineering for providing great assistance through the entire research process, Lancichinetti A. from Amaral Lab of Northwestern University for supporting their work unselfishly with implementation of some algorithms, and Chen Siming from University of Illinois at Chicago for his careful review, comments, and feedback on this paper. In addition, this research is supported by the National High-Tech R&D Program of China (nos. 2012AA012600, 2012AA01A401, and 2012AA01A402), the National Natural Science Foundation of China (no. 61202362), the State Key Development Program of Basic Research of China (no. 2013CB329601), and Project funded by the China Postdoctoral Science Foundation (no. 2013M542560).