1. Introduction

MPE

Mathematical Problems in Engineering

1563-5147 1024-123X

Hindawi Publishing Corporation

10.1155/2015/934301

934301

Research Article

A Parallel Community Structure Mining Method in Big Social Networks

http://orcid.org/0000-0002-5959-0768

Jin

Songchang

^1,2 Yu

Philip S.

² Li

Shudong

¹ Yang

Shuqiang

¹ Peng

Haipeng

College of Computer

National University of Defense Technology

Changsha, Hunan 410073

China

nudt.edu.cn

Department of Computer Science

University of Illinois at Chicago

Chicago, IL 60607

USA

uic.edu

2015

2742015

2015 05 07 2014 02 08 2014 2742015

2015

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Community structure plays a key role in analyzing network features and helping people to dig out valuable hidden information. However, how to discover the hidden community structures is one of the biggest challenges in social network analysis, especially when the network size swells to a high level. Infomap is a top-class algorithm in nonoverlapping community structure detection. However, it is designed for single processor. When tackling large networks, its limited scalability makes it less effective in fully utilizing server resources. In this paper, based on infomap, we develop a scalable parallel nonoverlapping community detection method, Pinfomr (parallel Infomap with MapReduce), which utilizes the MapReduce framework to solve the two problems. Experiments on artificial networks and real datasets show that our parallel method has satisfying performance and scalability.

1. Introduction

A few common properties in many complex networks have been discovered: small-world property, scale-free feature, and community structure pattern [1–4]. Community structure is playing a key role in the formation and function of these networks [5]. However, it is one grave challenge in complex systems [6].

Current social networks have jumped to millions even billions of nodes [7]. Take Facebook for example, its monthly active user has reached 1.16 billion [8]. However, due to computational costs, traditional community discovery algorithms are willing, but unable to tackle such huge complex networks. So, it is necessary to implement a fast and scalable approach to detect communities in big social networks.

Network partitioning is NP-complete [9]. Partitioning a network into approximately equal sized components while minimizing the number of edges between different components is extremely important in parallel computing [10]. For example, parallelizing many applications involves the problem of assigning data or processes evenly to processors, while minimizing the communication traffic. However, when the network size reaches a certain level, direct segmentation on the original networks is not realistic, and there exist deficiencies of convergence rate of traditional algorithms.

Nowadays, mainstream servers are configured with high performance hardware. Empirical studies [11] have showed that infomap [12] is a top-class standalone algorithm for nonoverlapping community detection. However, due to the limitations of technological level, processing capability of single core has encountered a bottleneck and the scalability of infomap is suffered as a consequence, that is, because it only utilizes one core or processor of the server. Besides, computing resource waste is an additional product of infomap running on multiprocessor server. How to improve the scalability of infomap and make full use of servers is an awkward subject.

Information science is shifting from computing-intensive to data-intensive [13] with the advent of the era of big data. Some novel parallel computing frameworks shine, in which MapReduce [14] is one of the best. In this paper, based on our previous work [15], we present a new scalable parallel community detection method coalescing several existing excellent techniques, such as infomap, k-shell decomposition, multilevel network partitioning, and MapReduce. A high-level description of our approach is as follows. First, we divide the whole network into a number of partitions and the number of partitions is far less than that of community structures. To speed up the process, we develop an enhanced multilevel partitioning method. Next, with MapReduce, we run parallel method to mine the community structures simultaneously within the partitions. Finally, we collect the community structures together to form a final result.

Main contributions of this paper are as follows: (1) we propose a new model to mine community structure in big social networks. (2) We integrate k-shell decomposition theory with multilevel k-way partitioning algorithm to deal with peripheral nodes. (3) We implement a scalable and parallel infomap to uncover community structures and to improve resource utilization rate.

The rest of this paper is organized as follows. Section 2 briefly reviews some concepts and background information. Section 3 provides problem statement and detailed description of the parallel community detection method. In Section 4, we conduct a couple of experiments to evaluate the performance of the method proposed in this paper. Finally, Section 5 provides some concluding remarks and outlines future research directions.

2. Preliminary Knowledge 2.1. Relevant Concepts

In this paper, we only study undirected networks, which can be mathematically described as G, consisting of node set V and edge set E; n represents the number of nodes, vi∈V represents a node, and d(vi) means its degree; m represents the number of edges and ei,j is the edge between vi and vj, where 0<i≠j≤n.

Infomap is based on information-theory. So some information-theoretic concepts are briefly reviewed here. In information theory, the information contained in a distribution is called entropy. For a discrete random variable X={x1,x2,…,xn} with a probability distribution P(X), its entropy is(1)HX=-∑x∈X‍pxlog⁡px.

Mutual information calibrates the shared information between two distributions, X and Y. We define P(X,Y) as the joint probability of X and Y. Px(X) and Py(Y) are defined as marginal probability distribution of X and Y, respectively. Then, mutual information of X and Y is(2)IX;Y=∑y∈Y‍∑x∈X‍px,ylog⁡px,ypxxpyy.

Normalized mutual information (NMI) is often used for evaluating clustering result, information retrieval, feature selection, and so forth. Value range of NMI is [0,1] and when X and Y are the same, NMI(X;Y) equals 1.0. Consider(3)NMIX;Y=IX;YHXHY.

2.2. <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="M39"><mml:mrow><mml:mi mathvariant="bold-italic">K</mml:mi></mml:mrow></mml:math></inline-formula>-Shell Decomposition Theory

K -shell decomposition is a well-established method for analyzing the structure of large-scale networks [16–18]. In particular, it provides a method for identifying hierarchies in a network. It is assumed that importance of a node is not related to its degree but its location. The process assigns an integer index, ks, to each node, representing its location within the successive layers (k-shells) in the network. The ks index is a robust measure and the node ranking is not significantly influenced in the case of incomplete information. The k-core of a network G is the maximum subnetwork of G whose degree is no less than k. The k-shell of G is the set of all nodes belonging to the k-core of G but not to the (k+1)-core.

Nodes are assigned to k-shells based on their remaining degree, which is obtained by successive pruning of nodes with degree smaller than the ks value of the current layer. The decomposition process starts by removing all nodes with degree d=1. After that, some nodes may be left with one link. We then prune the system iteratively until there is no node left with d=1 in the network. The removed nodes, along with the corresponding links, form a k shell with index ks=1. In a similar fashion, we iteratively remove the next k-shell, where ks=2, and continue to remove higher k shells until all nodes are removed. As a result, each node is associated with one ks index, and the network can be viewed as the collection of all k shells. ks value of a node can be very different from its degree. In Figure 2, we can see that v9 has 7 neighbors with ks(v9)=1. Figure 5 is the result of Figure 2 of which peripheral nodes are processed.

2.3. Multilevel <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="M68"><mml:mrow><mml:mi mathvariant="bold-italic">k</mml:mi></mml:mrow></mml:math></inline-formula>-Way Partitioning Method

Partitioning the node set V of a network G into k disjoint subsets {V1,V2,…,Vk} is called a k-way partitioning of V. Each subset and the edges within the subset constitute a partition of G. Figure 1 shows a simple network with 5 communities surrounded by the dotted circles and 3 partitions.

Figure 1

A network with 5 communities and 3 partitions.

Figure 2

Schematic representation of k-shell decomposition.

Definition 1 (effective edge lost ratio).

An edge whose endpoints are in the same community, that is, intracommunity edge, is called an effective edge. If the endpoints of an effective edge are divided into different partitions, then we call it an effective edge lost. The effective edge lost ratio is the percentage of the effective edge lost divided by the total number of edges in the network.

In Figure 1, e3,4 is an effective lost, whereas e1,2 is not. It is apparent that effective edge lost plays a more important role in the community detection than the edges connecting nodes in different communities and being cut off by partition boundaries.

A number of high-quality and computationally efficient graph partitioning methods have been proposed and multilevel graph partitioning algorithms [9, 19, 20] are currently considered to be a start-of-the-art method and being extensively used. Here, we optimize the multilevel k-way partitioning method proposed by Abou-Rjeili and Karypis [21] to partition the power law networks.

From Figure 3, we can see that multilevel k-way partitioning method consists of coarsening phase, initial partitioning phase, and uncoarsening and refinement phase. Instead of trying to partition directly on the original graph G0, multilevel partitioning algorithms first obtain a sequence of successive approximations, such as G1, G2, and G3 in the coarsening phase, of the original graph. Each of these approximations represents smaller than the size of the original graph. This process continues until a level of approximation is reached where the graph contains only a few hundreds of nodes. At this point, partitioning algorithms begin to compute partitions of that graph, corresponding to the five small partitions of G3 in the initial partitioning phase, and since the graph is quite small, even simple algorithms are able to take it over and get reasonably good results, such as K-L [22]. And there is a parameter used to control the balance of partitions. The final step, uncoarsening and refinement phase, is to map the partitions of the smallest graph G3 onto the original graph G0 and to derive final partitions.

Figure 3

A schematic diagram of multilevel k-way partitioning method.

2.4. Infomap

In this paper, we continue our work on the information theoretic community detection model-infomap. First, we briefly review the model. It utilizes the duality between compressing a data set and detecting and extracting significant patterns or structures within the data, which is a statistical concept known as minimum description length statistics [23]. A random walk, represented as a Markov process, is used as a proxy for information flow. For a community-structured network, when a random walker enters a community, it tends to stay in it for a long time and the possibility of moving out into another community is low.

In an undirected network, the random walk has a state x(t)∈V at time t, indicating where it is. Then in next step, t+1, the walker will move to vj chosen at random from neighbors of vi. To describe the state of random walker, a 2-level description model with Huffman coding is proposed. The first level encodes the communities and the second level encodes the nodes within the communities. Then we can use “community ID + node ID” to uniquely describe a particular node in the network. Huffman codes are prefix-free coding scheme and are optimally efficient for symbol-by-symbol encoding. It saves space by assigning short codewords to common events or objects and long codewords to rare ones, just as Morse code uses short codes for common letters and longer codes for rare ones. So, the path of the random walker can be described as a coding sequence.

Figure 4 is an example for illustrating the 2-level description method. Assuming there are 2 communities divided like Figure 4(b), then the code sequence for the random walk in Figure 4(a) is “0 111 00 10 111 010 10 011 110 00 10 110 1011 1 00 01 10 00 11 10 01 1.” The underlined word “0” in bold format means random walk starting from C1. The underlined word “1011 1” in bold format means random walk leaving from C1 and entering C2. The description length of the sequence will be 50 bits and about 2.63 bits per step. But, in Figure 4(c), we will need 57 bits and 3.0 bits per step. Community division is obviously more reasonable in Figure 4(b) than in Figure 4(c), and the average description length in the former one is shorter than in the latter one. From the perspective of information theory, we know that smaller entropy corresponds to smaller uncertainty. Corresponding to the community detection, smaller entropy means smaller indistinctness or clearer community structure.

Figure 4

Random walk and 2-level Huffman coding on a network with two communities.

(a) (b) (c)

Figure 5

The result of Figure 2 without peripheral nodes. Integers indicate node weights.

3. Parallel Community Detection Method 3.1. Problem Statement

Assuming there is an optimal community division, M∗, in a community-structured network G. With M∗, the network G is divided into num∗ communities, and the lower limit of average description code length is L(M∗). According to the Shannon source coding theorem [24] and the Kraft’s inequality [25], we know that, for any division pattern M, the average codeword length per source symbol, L(M), for an optimal prefix-free code satisfies(4)HX≤LM∗≤LM≤HX+1.

Obviously, calculating an endless random walk on a network to get L(M) is not realistic. Fortunately, when randomly walking on a network endlessly, we will get a steady visit frequency for each node, and we can calculate that easily with many methods, such as PageRank. With the steady visit frequency distribution, we will be able to describe the state of the random walker easily. For x(t+1)∈{neighbor(x(t))}, the probability of x(t+1) and x(t) being in the same community is pwithin and the probability of being in different communities is qk,out, where x(t+1) belongs to community k. Then the L(M) can be described as(5)LM=qoutHQ+∑i=1num‍pwithiniHPi,

where qout means the probability of moving out from the current community and qout=∑i=1num‍qi,out. H(Q) is the average description length of nodes in all communities, and it can be expressed as(6)HQ=∑i=1num‍HiQ=-∑i=1num‍qi,out∑j=1numqj,outlog⁡qi,out∑j=1numqj,out.

With the probability pa or p(a) to visit the node va, pwithini represents the probability of staying in the current community during the next step, and pwithini=∑va∈Ci‍pa+qi,out. H(Pi) expresses the infor mation entropy of the visiting probability of the nodes in the community Ci, which can be written as(7)HPi=-qi,outqi,out+∑vb∈Cipb ×log⁡qi,outqi,out+∑vb∈Cipb -∑va∈Ci‍paqi,out+∑vb∈Cipb ×log⁡paqi,out+∑vb∈Cipb.

For the NP-complete challenge, we cannot achieve the global optimal division pattern M∗ by direct computing on a big social network. However, we can archive a set of local optima to approximate M∗ by partitioning the network into small subnetworks (partitions) and tackling them independently with MapReduce. Then, the issue will become how to discover optimal division pattern Mi∗ in partition Pi and get final M∗=⋃i=1k‍{Mi∗}. For Pi, it would be sufficient to calculate the L(Mi) for different Mis and pick up the one with shortest description length as Mi∗. Finally, we get a community set C={C1,C2,…,Ck}, where Ci={ci,1,ci,2,…,ci,k1} corresponds to Pi, ∪i=1kiCi=V and Ci∩Cj=∅, where 0<i≠j≤n and |P|≪|C|.

3.2. Procedure of the Parallel Community Detection

For the convenience of illustration, we adopt Figure 1 to start this section and assume that the amount of partitions is far less than that of communities in big social networks. There are 3 stages in the parallel community detecting process.

In the first stage, we calculate the steady visiting probability of all nodes (shown in Algorithm 1). Here, we modify the traditional PageRank, which is used to deal with directed networks and run a iterative MapReduce-based version to get the global steady visit probability vector. In each iteration, visit probability of va is (since there is no teleport and link sink in undirected networks, we set τ=0)(8)pa=τ×1n+1-τ×∑vi∈{neighbor(va)}pidi.

<bold>Algorithm 1: </bold>Steady visiting probability vector calculation on MapReduce (VPC).

(1) method Map(nid n, node v)

(2) p←pv/|v.adjacencyList| //(8)

(3) emit(nid n, v) //pass along network structure

(4) for all nodeid m∈v.adjacencyList do

(5) emit(nid m, p) //pass pagerank contribution to neighbors

(6) endfor

(7) method Reduce(nid m,[p1,p2,…])

(8) v←ϕ

(9) for all p∈counts[p1,p2,…] do

(10) if isNode(p) then

(11) v←p //recover local network structure

(12) else

(13) s←s+p //sums pagerank contributions

(14) endif

(15) endfor

(16) pv←s

(17) emit(nid m, v)

Second, we use multilevel k-way partitioning method enhanced by k-shell decomposition method to divide a big social network into k approximately equal sized disjoint partitions (P1, P2, and P3 in Figure 1). Edges cut off by partition boundaries will be discarded. As networks studied here are sparse and with community structure, edge cut (lost) ratio will be low. However, partitioning method has a decisive influence on the final community detection effectiveness which will be explained with experiments in Section 4. A matching of a network is a set of edges and no two edges in it share a same node. To coarsen a network, a commonly used method is to collapse the node pairs forming the matching, such as random matching, heavy-edge matching, and maximum weighted matching. However, it shrinks at a slow rate and does not consider the relative importance of nodes in different locations. We all know that there is a large number of low-degree and low ks value nodes in power law networks, so we can turn this characteristic into revenue. Here, we use the k-shell decomposition to merge the peripheral nodes with high speed and more accurate performance during the coarsening phase (shown in Algorithm 2).

<bold>Algorithm 2: </bold>Amalgamation of peripheral nodes in coarsening phase.

(1) set int id[n+1]={0,1,2,…,n}

(2) while true, do

(3) for all node vi∈V, do

(4) if d[i]==1, then

(5) tmpV.add(vi) //a list to store nodes with ks=1

(6) endif

(7) endfor

(8) if tmpV.isEmpty(), then

(9) break //process finished

(10) endif

(11) for all node vj∈tmpV, do

(12) ifflag[j]!=true, then //v(j) hasn’t been annexed

(13) k=getID(vj.getDiffNeighbor()) //get rid off v(j) itself

(14) id[j]=k //assign new id to v(j) annexed by v(k)

(15) wk=wk+1

(16) updateNeighbor(vk,j) //replace v(j) with v(k) in v(k)’s neighbor

(17) d[k]=d[k]-1

(18) d[j]=d[j]-1=0

(19) flag[j]=flag[k]=true

(20) endif

(21) endfor

(22) endwhile

In the last stage, parallel community detection method is carried out on all partitions (such as the 3 partitions in Figure 1) and all partitions are tackled independently. When the parallel process finished, each partition generates a community set (such as the 2 communities in P1 in Figure 1). Combining all the community sets together will be the final result. Figure 6 gives a straightforward statement of this process and more detail is shown in Algorithm 3. At the beginning of this stage, we treat each node as a community and then use a bottom-up approach with greedy scheme to (1) find out community pairs that can minimize ΔL(M) and (2) merge them to form new communities.

<bold>Algorithm 3: </bold>The map procedure of community detection on a subnetwork <inline-formula><mml:math display="block" xmlns:mml="http://www.w3.org/1998/Math/MathML" id="M210"><mml:mrow><mml:msub><mml:mrow><mml:mi>G</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>.

(1) Initialization-global variables

(2) nc=|Vt| //number of communities

(3) L0=0

(4) for i from 1 to |Vt|, do

(5) cid[i]=i //community ID of vi

(6) L0=L0+H(vi)

(7) endfor

(8) method Map(node v,adjacencyList)

(9) while δ<0, do

(10) for i∈[1,|Vt|], do

(11) lock[i]=false //able to merge in current iteration

(12) endfor

(13) for i∈[1,|Vt|], do

(14) j=cid[random(|Vt|)+1] //randomly select a node(community)

(15) curNeighbor[ ]=getAdjacencyList(cj)

(16) δ=0 //decrease of L

(17) maxCID=0 //community id which leads to minimum δ

(18) for cs in curNeighbor[ ], do

(19) if Ls,k-L0<δ //if merge cs and ck

(20) maxCID=s

(21) δ=Ls,k-L0

(22) endif

(23) endfor

(24) if maxID≠ 0 & δ ≠0, then

(25) lock[s]=lock[k]=true

(26) nc=nc-1

(27) L0=L0-δ

(28) pid[s]=j

(29) endif

(30) endfor

(31) endwhile

(32) for i∈[1,nc], do

(33) emit(i,getAdjacencyList(ci))

(34) endfor

Figure 6

A schematic diagram of MapReduce process for community detection.

4. Experiments and Analysis

In this section, we conduct several experiments and analyze the results. All experiments are running on the Hadoop-1.1.1 cluster of Antivision Software Ltd. The cluster consists of 20 PowerEdge R320 servers (Intel Xeon CPU E5-1410 @2.80 GHz, memory 8 GB) with 64-bit NeoKylin Linux OS, and servers are connected by a Cisco 3750G-48TS-S switch. Data sets are shown in Table 1, including artificial networks and real networks.

Table 1

Data sets used in experiments (increment of mix is set to 0.05. M=106).

Data set	n	m	avg(d)	γ	β	Size(C)	mix
LiveJournal	3,997,962	34,681,189	/	/	/	/	/
Youtube	1,134,890	2,987,624	/	/	/	/	/
Orkut	3,072,441	117,185,083	/	/	/	/	/
D0	0.1 M	/	45	2.5	1.5	[ 25,120 ]	[ 0.1,0.75 ]
D1	0.2 M	/	40	2.5	1.5	[ 25,100 ]	[ 0.1,0.75 ]
D2	0.4 M	/	40	2.5	1.5	[ 25,100 ]	[ 0.1,0.75 ]
D3	80,000	/	45	2.5	1.5	[ 25,120 ]	[ 0.1,0.75 ]
D4	0.2 M, 0.4 M,	/	45	2.5	1.5	[ 25,100 ]	0.45
D4	0.8 M, 1.6 M	/	45	2.5	1.5	[ 25,100 ]	0.45
D5	3.2 M, 6.4 M,	/	45	2.5	1.5	[ 25,100 ]	0.45
D5	10 M	/	45	2.5	1.5	[ 25,100 ]	0.45

All artificial networks used here are generated by LFR benchmark. In LFR, some parameters give us a direct control on network properties: network size (n), degree distribution (γ, dmax⁡, avg(d)), and community structure (β, mix) [26]. γ and β are exponents for degree and community size distributions, which range between [2,3] and [1,2], respectively. Mix is the ratio of edges connecting nodes from different communities divided by collective edges of all communities. For the average and balance, we set γ=2.5 and β=1.5 for artificial networks.

4.1. Accuracy Experiments

In accuracy experiments, we compare our method, Pinfomr, with two top-class methods, Louvain algorithm [27] and OSLOM algorithm [28], on different data sets and with different partition numbers. The data sets used are D0, D1, and D2, and result is shown in Figure 7, where |P| means partition number. The situation when C=1 or |C|=n is defined by us as no community structure and NMI in this case is set to 0, but the case when |C| is close to n is discarded. Taking for instance Louvain in Figure 7(a), when mix=0.75, |C|/n=0.373 and we conclude that a community just contains 3 nodes averagely. From the design of LFR, we know that when mix value is too high, such as higher than 0.75, there will be no obvious community structures, and the network will not be a power law network but more like a random network which is not the focus here. Figure 7 indicates that Pinfomr is more stable and accurate than the others in uncovering community structures in power law networks. For running time, we can see that for the same data set, Louvain consumes the longest time and Pinfomr needs the shortest time. OSLOM requires a little more time than Pinfomr when mix parameter is not too big.

Figure 7

Accuracy and running time tests on different data sets.

(a)

Dataset D1 and P=4

(b)

Dataset D2 and P=8

(c)

Dataset D0 with different P

From Figure 7(c), we know that the NMI decrease as partition numbers increase, but the performance is excellent and stable when mix is less than 0.75, and NMI will maintain at about 1.0. Our results show that Pinfomr is able to achieve better results in a shorter period of time, although accompanied by some loss of performance.

4.2. Partitioning Experiment

In previous section, we have mentioned that the quality of partition will play a vital role in the final performance of parallel community detection. Therefore, we conduct experiment in this section to test the impact and effectiveness of different partitioning methods on Pinfomr.

We use two simple partitioning methods to compare with the improved multilevel k-way partitioning method. One is a sequential partitioning method dividing the network according to the storage order of the nodes and edges on the HDFS. The other one is a random matching partitioning method by randomly choosing nodes to generate a matching. For example, assuming that we bisect G=(V,E) with n=20,000 and m=300,000 into G1=(V1,E1) and G2=(V2,E2), (1) when using sequential partitioning method, the first 10,000 nodes will be collected to form V1 and links within V1 will form E1. The other nodes are left for V2 and links within V2 form E2. (2) If we use the random matching method, we will randomly select 5,000 node pairs into V1 and all links within V1 will form E1, and the process for G2 is similar to G1. Dividing a connected network into subnetworks or partitions will cause edge loss. Excellent partitioning methods will always try to walk through the slits between communities and avoid cutting off the effective edges. Here, we use the data set D1 to test performance of different partitioning methods with |P|=2 and |P|=4. In Figures 8(a) and 8(b), we can see that multilevel k-way partitioning method is stable and results of Pinfomr on it are very close to the results of infomap and also very close to the real results. From Figures 8(c) and 8(d), we get that, for multilevel k-way partitioning method, total edge loss ratio increases linearly along with the increase of mix parameter. It is easy to understand that, from the meaning of the mix parameter, effective edge loss ratio always remains at low level before mix rising up to 0.70. Manifestations of sequential partitioning method and random partitioning method are also easy to explain. Distribution of edges of LFR generated networks is random and uniform, regardless of storage order. As a result, the total edge loss ratio will remain at about (|P|-1)/|P|. Effective edge loss decreases linearly with the increase of mix for the same reason when total edge loss ratio increases linearly with the increase of mix of multilevel partitioning method in Figures 8(a) and 8(b).

Figure 8

Performance and edge loss ratio of different partition methods on data set D3.

(a)

P = 2

(b)

P = 4

(c)

P = 2 , T, and E mean “total edge lost” and “effective edge lost,” respectively

(d)

P = 4

In addition, we conduct a degree distribution test on a real network-LiveJournal to verify performance of the improved multilevel partitioning method. The network is divided into 4 partitions by the improved multilevel k-way partitioning method, and the degree distributions corresponding to the original network and the 4 subnetworks are shown in Figure 9. Comparative observation indicates that the subnetworks got from the improved multilevel k-way partitioning method are able to maintain the degree distribution characteristics of the original network.

Figure 9

Degree distribution test on LiveJournal data set.

(a)

Degree distribution of LiveJournal

(b)

Degree distribution of subnetworks of LiveJournal

4.3. Scalability and Performance Experiment

Our study aims to uncover community structures in big social networks and improve resource utilization as much as possible. Here, we unify the two problems together by means of MapReduce. With a small portion of expense of performance, we will achieve the goal. In this section, we will test the scalability and performance of the parallel community detection method, and data sets used are D4, D5, LiveJournal (http://snap.stanford.edu/data/com-LiveJournal.html), Youtube (http://snap.stanford.edu/data/as-skitter.html), and Orkut (http://snap.stanford.edu/data/com-Orkut.html).

For a certain network in Figures 10(a) and 10(b), when |P| increases, the speedup ratio will increase but the acceleration will become slow, since MapReduce needs some time to initiate before map tasks start to run and transmit data from map phase to reduce phase. For a certain |P|, as network size grows, the speedup ratio will become higher. For Figure 10(c), we just present the running time of parallel community detection method on MapReduce because the capacity of the servers cannot deal with such large networks on one server or with |P|=1.

Figure 10

Scalability tests on data sets D4 and D5.

(a)

Speedup ratio test on D4

(b)

Effective edge lost ratio test on D4

(c)

Running time test on D5

Finally, we apply the same process onto the real networks. Experiments on real networks shown in Figure 11 also confirm that our parallel community detection method has excellent scalability. From the results, we can conclude that, when |P| increases, the subnetwork size assigned to each map task will be smaller, and the total edge lost ratio will increase, which will further reduce the subnetwork size. From Figures 10 and 11, we can get the following: in the case of constant data size, the running time and |P| are linear approximation when |P| is small. When partition number is small, the running time is affected by the number of partitions significantly. When the partition number |P| reaches the “critical point” (Figure 11(c) Orkut, |P|=72 and Figure 11(b) Youtube, |P|=20), running time is less affected by the changes of partition number and shows “long tail effect” to some extent. The reason is that the cost of MapReduce is basically fixed. For a larger social network with the same number of map tasks, Mapreduces initial time accounts for a smaller proportion of the total running time. When partition number increases and the total running time decreases, the proportion of the initial time is not negligible. It makes our method exhibiting “long tail effect” in different data sets.

Figure 11

Scalability tests on real networks.

(a)

Scalability test on LiveJournal

(b)

Scalability test on Youtube

(c)

Scalability test on Orkut

5. Conclusion

Community detection has become an important research topic in social networks. Traditional algorithms on community mining cannot effectively adapt to the current big social network scenarios [29, 30]. Infomap is excellent standalone community detection method and, by means of multilevel k-way partitioning method enhanced by k-shell deposition, we are able to develop a parallel community discovery method on MapReduce framework. Related experiments verified the validity of the proposed work in this paper, and it may possess some reference meaning for social network analysis and social community mining with the big data techniques. Next, well try to use some overlapping partitioning methods to further improve the community detection accuracy.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The authors would like to express their sincere gratitude to Zhang Yuchao from Beijing Institute of System Engineering for providing great assistance through the entire research process, Lancichinetti A. from Amaral Lab of Northwestern University for supporting their work unselfishly with implementation of some algorithms, and Chen Siming from University of Illinois at Chicago for his careful review, comments, and feedback on this paper. In addition, this research is supported by the National High-Tech R&D Program of China (nos. 2012AA012600, 2012AA01A401, and 2012AA01A402), the National Natural Science Foundation of China (no. 61202362), the State Key Development Program of Basic Research of China (no. 2013CB329601), and Project funded by the China Postdoctoral Science Foundation (no. 2013M542560).

Girvan

Newman

M. E.

Community structure in social and biological networks

Proceedings of the National Academy of Sciences of the United States of America 2002 99 12 7821 7826

10.1073/pnas.122653799

2-s2.0-0037062448

Lee

Cunningham

Community detection: effective evaluation on large social networks

Journal of Complex Networks 2013 2 1 19 37

10.1093/comnet/cnt012

Watts

D. J.

Strogatz

S. H.

Collective dynamics of “small-world” networks

Nature 1998 393 6684 440 442

10.1038/30918

2-s2.0-0032482432

Boccaletti

Latora

Moreno

Chavez

Hwang

D.-U.

Complex networks: structure and dynamics

Physics Reports 2006 424 4-5 175 308

10.1016/j.physrep.2005.10.009

2-s2.0-31344474880

Newman

M. E. J.

The structure and function of complex networks

SIAM Review 2003 45 2 167 256

10.1137/S003614450342480

MR2010377

2-s2.0-0038718854

Newman

M. E. J.

Leicht

E. A.

Mixture models and exploratory analysis in networks

Proceedings of the National Academy of Sciences of the United States of America 2007 104 23 9564 9569

10.1073/pnas.0610537104

ZBL1155.91026

2-s2.0-34547405111

Chen

G. R.

Wang

X. F.

Introduction to Complex Networks: Models, Structures and Dynamics 2012

Beijing, China

Higher Education Press

Cooper

The Largest Social Networks in the World Include Some Big Surprises [Business Insider]

January 2014, http://www.businessinsider.com/the-largest-social-networks-in-the-world-2013-12

Bui

T. N.

Jones

A heuristic for reducing fill-in in sparse matrix factorization

Proceedings of the 6th SIAM Conference on Parallel Processing for Scientific Computing (PPSC '93)

1993

445 452

Andreev

Räcke

Balanced graph partitioning

Theory of Computing Systems 2006 39 6 929 939

10.1007/s00224-006-1350-7

2-s2.0-33845324514

Lancichinetti

Fortunato

Community detection algorithms: a comparative analysis

Physical Review E 2009 80 5

056117

10.1103/PhysRevE.80.056117

2-s2.0-71849108522

Rosvall

Bergstrom

C. T.

Multilevel compression of random walks on networks reveals hierarchical organization in large integrated systems

PLoS ONE 2011 6 4

e18209

10.1371/journal.pone.0018209

2-s2.0-79954508889

Ding

Advancement of operating system to manage critical resources in increasingly complex computer architecture. (Electronic Thesis or Dissertation), 2010, https://etd.ohiolink.edu/

Dean

Ghemawat

MapReduce: simplified data processing on large clusters

Communications of the ACM 2008 51 1 107 113

10.1145/1327452.1327492

2-s2.0-37549003336

Jin

Yang

Lin

Deng

A MapReduce and information compression based social community structure mining method

Proceedings of the IEEE 16th International Conference on Computational Science and Engineering (CSE '13)

December 2013

Sydney, Australia

IEEE

971 980

10.1109/CSE.2013.143

Kitsak

Gallos

L. K.

Havlin

Liljeros

Muchnik

Stanley

H. E.

Makse

H. A.

Identification of influential spreaders in complex networks

Nature Physics 2010 6 11 888 893

10.1038/nphys1746

2-s2.0-78149283685

Zhao

Peng

Luo

Yang

Multiple routes transmitted epidemics on multiplex networks

Physics Letters A 2014 378 10 770 776

Zhao

Peng

Yang

An efficient patch dissemination strategy for mobile networks

Mathematical Problems in Engineering 2013 2013 13

896187

10.1155/2013/896187

Karypis

Kumar

Analysis of multilevel graph partitioning

Proceedings of the ACM/IEEE Conference on Supercomputing

1995

ACM

Alderson

Doyle

J. C.

Willinger

Towards a theory of scale-free graphs: definition, properties, and implications

Internet Mathematics 2005 2 4 431 523

Abou-Rjeili

Karypis

Multilevel algorithms for partitioning power-law graphs

Proceedings of the Parallel and Distributed Processing Symposium

2006

Kernighan

B. W.

Lin

An efficient heuristic procedure for partitioning graphs

The Bell System Technical Journal 1970 49 2 291 307

Grnwald

P. D.

Myung

I. J.

Pitt

M. A.

Advances in Minimum Description Length: Theory and Applications 2005

New York, NY, USA

MIT press

Shannon

C. E.

A mathematical theory of communication

ACM SIGMOBILE Mobile Computing and Communications Review 2001 5 1 3 55

Huffman

D. A.

A method for the construction of minimum-redundancy codes

Proceedings of the IRE 1952 40 9 1098 1101

Orman

G. K.

Labatut

Cherifi

Towards realistic artificial benchmark for community detection algorithms evaluation

International Journal of Web Based Communities 2013 9 3 349 370

10.1504/IJWBC.2013.054908

2-s2.0-84880087214

Lancichinetti

Radicchi

Ramasco

J. J.

Fortunato

Finding statistically significant communities in networks

PLoS ONE 2011 6 4

e18961

10.1371/journal.pone.0018961

2-s2.0-79955707583

Blondel

V. D.

Guillaume

Lambiotte

Lefebvre

Fast unfolding of communities in large networks

Journal of Statistical Mechanics: Theory and Experiment 2008 2008 10

P10008

10.1088/1742-5468/2008/10/P10008

2-s2.0-56349094785

Fractal time series—a tutorial review

Mathematical Problems in Engineering 2010 2010 26

157264

10.1155/2010/157264

MR2570932

2-s2.0-77951489276

Chen

Han

Correlation matching method for the weak stationarity test of LRD traffic

Telecommunication Systems 2010 43 3-4 181 195

10.1007/s11235-009-9206-5

2-s2.0-77951023690