A Parallel Community Structure Mining Method in Big Social Networks

Community structure plays a key role in analyzing network features and helping people to dig out valuable hidden information. However, how to discover the hidden community structures is one of the biggest challenges in social network analysis, especially when the network size swells to a high level. Infomap is a top-class algorithm in nonoverlapping community structure detection. However, it is designed for single processor. When tackling large networks, its limited scalability makes it less effective in fully utilizing server resources. In this paper, based on infomap, we develop a scalable parallel nonoverlapping community detection method, Pinfomr (parallel Infomap with MapReduce), which utilizes the MapReduce framework to solve the two problems. Experiments on artificial networks and real datasets show that our parallel method has satisfying performance and scalability.


Introduction
A few common properties in many complex networks have been discovered: small-world property, scale-free feature, and community structure pattern [1][2][3][4].Community structure is playing a key role in the formation and function of these networks [5].However, it is one grave challenge in complex systems [6].
Current social networks have jumped to millions even billions of nodes [7].Take Facebook for example, its monthly active user has reached 1.16 billion [8].However, due to computational costs, traditional community discovery algorithms are willing, but unable to tackle such huge complex networks.So, it is necessary to implement a fast and scalable approach to detect communities in big social networks.
Network partitioning is NP-complete [9].Partitioning a network into approximately equal sized components while minimizing the number of edges between different components is extremely important in parallel computing [10].For example, parallelizing many applications involves the problem of assigning data or processes evenly to processors, while minimizing the communication traffic.However, when the network size reaches a certain level, direct segmentation on the original networks is not realistic, and there exist deficiencies of convergence rate of traditional algorithms.
Nowadays, mainstream servers are configured with high performance hardware.Empirical studies [11] have showed that infomap [12] is a top-class standalone algorithm for nonoverlapping community detection.However, due to the limitations of technological level, processing capability of single core has encountered a bottleneck and the scalability of infomap is suffered as a consequence, that is, because it only utilizes one core or processor of the server.Besides, computing resource waste is an additional product of infomap running on multiprocessor server.How to improve the scalability of infomap and make full use of servers is an awkward subject.
Information science is shifting from computing-intensive to data-intensive [13] with the advent of the era of big data.Some novel parallel computing frameworks shine, in which MapReduce [14] is one of the best.In this paper, based on our previous work [15], we present a new scalable parallel community detection method coalescing several existing excellent techniques, such as infomap, -shell decomposition, multilevel network partitioning, and MapReduce.A highlevel description of our approach is as follows.First, we divide the whole network into a number of partitions and the number of partitions is far less than that of community structures.To speed up the process, we develop an enhanced multilevel partitioning method.Next, with MapReduce, we run parallel method to mine the community structures simultaneously within the partitions.Finally, we collect the community structures together to form a final result.
Main contributions of this paper are as follows: (1) we propose a new model to mine community structure in big social networks.(2) We integrate -shell decomposition theory with multilevel -way partitioning algorithm to deal with peripheral nodes.(3) We implement a scalable and parallel infomap to uncover community structures and to improve resource utilization rate.
The rest of this paper is organized as follows.Section 2 briefly reviews some concepts and background information.Section 3 provides problem statement and detailed description of the parallel community detection method.In Section 4, we conduct a couple of experiments to evaluate the performance of the method proposed in this paper.Finally, Section 5 provides some concluding remarks and outlines future research directions.

Relevant Concepts.
In this paper, we only study undirected networks, which can be mathematically described as , consisting of node set  and edge set ;  represents the number of nodes, V  ∈  represents a node, and (V  ) means its degree;  represents the number of edges and  , is the edge between V  and V  , where 0 <  ̸ =  ≤ .Infomap is based on information-theory.So some information-theoretic concepts are briefly reviewed here.In information theory, the information contained in a distribution is called entropy.For a discrete random variable  = { 1 ,  2 , . . .,   } with a probability distribution (), its entropy is ( Mutual information calibrates the shared information between two distributions,  and .We define (, ) as the joint probability of  and .  () and   () are defined as marginal probability distribution of  and , respectively.Then, mutual information of  and  is Normalized mutual information (NMI) is often used for evaluating clustering result, information retrieval, feature selection, and so forth.Value range of NMI is [0, 1] and when  and  are the same, NMI(; ) equals 1.0.Consider NMI (; ) =  (; ) √ ()  () .

𝐾-Shell
Decomposition Theory.-shell decomposition is a well-established method for analyzing the structure of largescale networks [16][17][18].In particular, it provides a method for identifying hierarchies in a network.It is assumed that importance of a node is not related to its degree but its location.The process assigns an integer index, , to each node, representing its location within the successive layers (shells) in the network.The  index is a robust measure and the node ranking is not significantly influenced in the case of incomplete information.The -core of a network  is the maximum subnetwork of  whose degree is no less than .The -shell of  is the set of all nodes belonging to the -core of  but not to the ( + 1)-core.
Nodes are assigned to -shells based on their remaining degree, which is obtained by successive pruning of nodes with degree smaller than the  value of the current layer.The decomposition process starts by removing all nodes with degree  = 1.After that, some nodes may be left with one link.We then prune the system iteratively until there is no node left with  = 1 in the network.The removed nodes, along with the corresponding links, form a  shell with index  = 1.In a similar fashion, we iteratively remove the next -shell, where  = 2, and continue to remove higher  shells until all nodes are removed.As a result, each node is associated with one  index, and the network can be viewed as the collection of all  shells. value of a node can be very different from its degree.In Figure 2, we can see that V 9 has 7 neighbors with (V 9 ) = 1. Figure 5 is the result of Figure 2 of which peripheral nodes are processed.

Multilevel 𝑘-Way
Partitioning Method.Partitioning the node set  of a network  into  disjoint subsets { 1 ,  2 , . . .,   } is called a -way partitioning of .Each subset and the edges within the subset constitute a partition of . Figure 1 shows a simple network with 5 communities surrounded by the dotted circles and 3 partitions.Definition 1 (effective edge lost ratio).An edge whose endpoints are in the same community, that is, intracommunity edge, is called an effective edge.If the endpoints of an effective edge are divided into different partitions, then we call it an effective edge lost.The effective edge lost ratio is the percentage of the effective edge lost divided by the total number of edges in the network.
In Figure 1,  3,4 is an effective lost, whereas  1,2 is not.It is apparent that effective edge lost plays a more important role in the community detection than the edges connecting nodes in different communities and being cut off by partition boundaries.
A number of high-quality and computationally efficient graph partitioning methods have been proposed and multilevel graph partitioning algorithms [9,19,20] are currently considered to be a start-of-the-art method and being extensively used.Here, we optimize the multilevel -way partitioning method proposed by Abou-Rjeili and Karypis [21] to partition the power law networks.
From Figure 3, we can see that multilevel -way partitioning method consists of coarsening phase, initial partitioning phase, and uncoarsening and refinement phase.Instead of trying to partition directly on the original graph  0 , multilevel partitioning algorithms first obtain a sequence of successive approximations, such as  1 ,  2 , and  3 in the coarsening phase, of the original graph.Each of these approximations represents smaller than the size of the original graph.This process continues until a level of approximation is reached where the graph contains only a few hundreds of nodes.At this point, partitioning algorithms begin to compute partitions of that graph, corresponding to the five small partitions of  3 in the initial partitioning phase, and since the graph is quite small, even simple algorithms are able to take it over and get reasonably good results, such as K-L [22].And there is a parameter used to control the balance of partitions.The final step, uncoarsening and refinement phase, is to map the partitions of the smallest graph  3 onto the original graph  0 and to derive final partitions.

Output network
2.4.Infomap.In this paper, we continue our work on the information theoretic community detection model-infomap.First, we briefly review the model.It utilizes the duality between compressing a data set and detecting and extracting significant patterns or structures within the data, which is a statistical concept known as minimum description length statistics [23].A random walk, represented as a Markov process, is used as a proxy for information flow.For a community-structured network, when a random walker enters a community, it tends to stay in it for a long time and the possibility of moving out into another community is low.
In an undirected network, the random walk has a state () ∈  at time , indicating where it is.Then in next step,  + 1, the walker will move to V  chosen at random from neighbors of V  .To describe the state of random walker, a 2-level description model with Huffman coding is proposed.The first level encodes the communities and the second level encodes the nodes within the communities.Then we can use "community ID + node ID" to uniquely describe a particular node in the network.Huffman codes are prefix-free coding scheme and are optimally efficient for symbol-by-symbol encoding.It saves space by assigning short codewords to common events or objects and long codewords to rare ones, just as Morse code uses short codes for common letters and longer codes for rare ones.So, the path of the random walker can be described as a coding sequence.Figure 4 is an example for illustrating the 2-level description method.Assuming there are 2 communities divided like Figure 4(b), then the code sequence for the random walk in Figure 4(a) is "0 111 00 10 111 010 10 011 110 00 10 110 1011 1 00 01 10 00 11 10 01 1."The underlined word "0" in bold format means random walk starting from C1.The underlined word "1011 1" in bold format means random walk leaving from C1 and entering C2.The description length of the sequence will be 50 bits and about 2.63 bits per step.But, in Figure 4(c), we will need 57 bits and 3.0 bits per step.Community division is obviously more reasonable in Figure 4(b) than in Figure 4(c), and the average description length in the former one is shorter than in the latter one.From the perspective of information theory, we know that smaller entropy corresponds to smaller uncertainty.Corresponding to the community detection, smaller entropy means smaller indistinctness or clearer community structure.

Problem Statement.
Assuming there is an optimal community division,  * , in a community-structured network .
With  * , the network  is divided into num * communities, and the lower limit of average description code length is ( * ).According to the Shannon source coding theorem [24] and the Kraft's inequality [25], we know that, for any division pattern , the average codeword length per source symbol, (), for an optimal prefix-free code satisfies Obviously, calculating an endless random walk on a network to get () is not realistic.Fortunately, when randomly walking on a network endlessly, we will get a steady visit frequency for each node, and we can calculate that easily with many methods, such as PageRank.With the steady visit frequency distribution, we will be able to describe the state of the random walker easily.For ( + 1) ∈ {neighbor(())}, the probability of (+1) and () being in the same community is  within and the probability of being in different communities is  ,out , where ( + 1) belongs to community .Then the () can be described as where  out means the probability of moving out from the current community and  out = ∑ num =1  ,out .() is the average description length of nodes in all communities, and it can be expressed as With the probability   or () to visit the node V  ,   within represents the probability of staying in the current community during the next step, and (3) emit(nid , V) //pass along network structure (4) for all nodeid  ∈ V. do (5) emit(nid , ) //pass pagerank contribution to neighbors (6) endfor (7) method (nid ,[ 1 ,  2 , . ..]) (8) V ←  (9) for all  ∈ [ 1 ,  2 , . ..] do (10) if () then (11) V ←  //recover local network structure (12) else (13)  ←  +  //sums pagerank contributions ( 14) endif (15) endfor (16) Algorithm 1: Steady visiting probability vector calculation on MapReduce (VPC).
expresses the infor mation entropy of the visiting probability of the nodes in the community   , which can be written as For the NP-complete challenge, we cannot achieve the global optimal division pattern  * by direct computing on a big social network.However, we can archive a set of local optima to approximate  * by partitioning the network into small subnetworks (partitions) and tackling them independently with MapReduce.Then, the issue will become how to discover optimal division pattern  *  in partition   and get final  * = ⋃  =1 { *  }.For   , it would be sufficient to calculate the (  ) for different   s and pick up the one with shortest description length as  *  .Finally, we get a community set  = { 1 ,  2 , . . .,   }, where   = { ,1 ,  ,2 , . . .,  ,1 } corresponds to   , ∪   =1   =  and   ∩   = 0, where 0 <  ̸ =  ≤  and || ≪ ||.

Procedure of the Parallel Community Detection.
For the convenience of illustration, we adopt Figure 1 to start this section and assume that the amount of partitions is far less than that of communities in big social networks.There are 3 stages in the parallel community detecting process.
In the first stage, we calculate the steady visiting probability of all nodes (shown in Algorithm 1).Here, we modify the traditional PageRank, which is used to deal with directed networks and run a iterative MapReduce-based version to get the global steady visit probability vector.In each iteration, visit probability of V  is (since there is no teleport and link sink in undirected networks, we set  = 0) Second, we use multilevel -way partitioning method enhanced by -shell decomposition method to divide a big social network into  approximately equal sized disjoint partitions (1, 2, and 3 in Figure 1).Edges cut off by partition boundaries will be discarded.As networks studied here are sparse and with community structure, edge cut (lost) ratio will be low.However, partitioning method has a decisive influence on the final community detection effectiveness which will be explained with experiments in Section 4. A matching of a network is a set of edges and no two edges in it share a same node.To coarsen a network, a commonly used method is to collapse the node pairs forming the matching, such as random matching, heavy-edge matching, and maximum weighted matching.However, it shrinks at a slow rate and does not consider the relative importance of nodes in different locations.We all know that there is a large number of low-degree and low  value nodes in power law networks, so we can turn this characteristic into revenue.Here, we use the -shell decomposition to merge the peripheral nodes with high speed and more accurate performance during the coarsening phase (shown in Algorithm 2).

Experiments and Analysis
In this section, we conduct several experiments and analyze the results.All experiments are running on the Hadoop-1.1.1 cluster of Antivision Software Ltd.The cluster consists of 20 PowerEdge R320 servers (Intel Xeon CPU E5-1410 @2.80 GHz, memory 8 GB) with 64-bit NeoKylin Linux OS, and servers are connected by a Cisco 3750G-48TS-S switch.Data sets are shown in Table 1, including artificial networks and real networks.
All artificial networks used here are generated by LFR benchmark.In LFR, some parameters give us a direct control on network properties: network size (), degree distribution (,  max , avg()), and community structure (, mix) [26]. and  are exponents for degree and community size distributions, which range between [2,3] and [1,2], respectively.Mix is the ratio of edges connecting nodes from different communities divided by collective edges of all communities.For the average and balance, we set  = 2.5 and  = 1.5 for artificial networks.

Accuracy Experiments.
In accuracy experiments, we compare our method, Pinfomr, with two top-class methods, Louvain algorithm [27] and OSLOM algorithm [28], on different data sets and with different partition numbers.The data sets used are 0, 1, and 2, and result is shown in Figure 7, where || means partition number.The situation when  = 1 or || =  is defined by us as no community structure and NMI in this case is set to 0, but the case when || is close to  is discarded.Taking for instance Louvain in Figure 7(a), when mix = 0.75, ||/ = 0.373 and we conclude that a community just contains 3 nodes averagely.From the design of LFR, we know that when mix value is too high, such as higher than 0.75, there will be no obvious community structures, and the network will not be a power law network but more like a random network which is not the focus here. 0 =  0 + (V  ) (7) endfor (8) method Map(node V,) (9) while  < 0, do (10) 7 indicates that Pinfomr is more stable and accurate than the others in uncovering community structures in power law networks.For running time, we can see that for the same data set, Louvain consumes the longest time and Pinfomr needs the shortest time.OSLOM requires a little more time than Pinfomr when mix parameter is not too big.
From Figure 7(c), we know that the NMI decrease as partition numbers increase, but the performance is excellent and stable when mix is less than 0.75, and NMI will maintain at about 1.0.Our results show that Pinfomr is able to achieve better results in a shorter period of time, although accompanied by some loss of performance.

Partitioning Experiment.
In previous section, we have mentioned that the quality of partition will play a vital role in the final performance of parallel community detection.Therefore, we conduct experiment in this section to test the impact and effectiveness of different partitioning methods on Pinfomr.
We use two simple partitioning methods to compare with the improved multilevel -way partitioning method.One is a sequential partitioning method dividing the network according to the storage order of the nodes and edges on the HDFS.The other one is a random matching partitioning method by randomly choosing nodes to generate a matching.For example, assuming that we bisect  = (, ) with  = 20, 000 and  = 300, 000 into  1 = ( 1 ,  1 ) and  2 = ( 2 ,  2 ), (1) when using sequential partitioning method, the first 10,000 nodes will be collected to form  1 and links within  1 will form  1 .The other nodes are left for  2 and links within  2 form  2 .(2) If we use the random matching method, we will randomly select 5,000 node pairs into  1 and all links within  1 will form  1 , and the process for  2 is similar to  1 .Dividing a connected network into subnetworks or partitions will cause edge loss.Excellent partitioning methods will always try to walk through the slits between communities and avoid cutting off the effective edges.Here, we use the data set 1 to test performance of different partitioning methods with || = 2 and || = 4.In Figures 8(a) and 8(b), we can see that multilevel -way partitioning method is stable and results of Pinfomr on it are very close to the results of infomap and also very close to the real results.From Figures 8(c) and 8(d), we get that, for multilevel -way partitioning method, total edge loss ratio increases linearly along with the increase of mix parameter.It is easy to understand that, from the meaning of the mix parameter, effective edge loss ratio always remains at low level before mix rising up to 0.70.Manifestations of sequential partitioning method and random partitioning method are also easy to explain.Distribution of edges of LFR generated networks is random and uniform, regardless of storage order.As a result, the total edge loss ratio will  In addition, we conduct a degree distribution test on a real network-LiveJournal to verify performance of the improved multilevel partitioning method.The network is divided into 4 partitions by the improved multilevel -way partitioning method, and the degree distributions corresponding to the original network and the 4 subnetworks are shown in Figure 9. Comparative observation indicates that the subnetworks got from the improved multilevel way partitioning method are able to maintain the degree distribution characteristics of the original network.

Scalability and Performance Experiment.
Our study aims to uncover community structures in big social networks and improve resource utilization as much as possible.Here, we unify the two problems together by means of MapReduce.With a small portion of expense of performance, we will achieve the goal.In this section, we will test the scalability and performance of the parallel community detection method, and data sets used are D4, D5, LiveJournal (http://snap.stanford.edu/data/com-LiveJournal.html), Youtube (http://snap.stanford.edu/data/as-skitter.html), and Orkut (http://snap.stanford.edu/data/com-Orkut.html).
For a certain network in Figures 10(a) and 10(b), when || increases, the speedup ratio will increase but the acceleration will become slow, since MapReduce needs some time to initiate before map tasks start to run and transmit data from map phase to reduce phase.For a certain ||, as network size grows, the speedup ratio will become higher.For Figure 10(c), we just present the running time of parallel community detection method on MapReduce because the capacity of the servers cannot deal with such large networks on one server or with || = 1.Finally, we apply the same process onto the real networks.Experiments on real networks shown in Figure 11 also confirm that our parallel community detection method has excellent scalability.From the results, we can conclude that, when || increases, the subnetwork size assigned to each map task will be smaller, and the total edge lost ratio will increase, which will further reduce the subnetwork size.From Figures 10 and 11, we can get the following: in the case of constant data size, the running time and || are linear approximation when || is small.When partition number is small, the running time is affected by the number of partitions significantly.When the partition number || reaches the "critical point" (Figure 11(c) Orkut, || = 72 and Figure 11(b) Youtube, || = 20), running time is less affected by the changes of partition number and shows "long tail effect" to some extent.The reason is that the cost of MapReduce is basically fixed.For a larger social network with the same number of map tasks, Mapreduces initial time accounts for a smaller proportion of the total running time.When partition number increases and the total running time decreases, the proportion of the initial time is not negligible.It makes our method exhibiting "long tail effect" in different data sets.

Conclusion
Community detection has become an important research topic in social networks.Traditional algorithms on community mining cannot effectively adapt to the current big social network scenarios [29,30].Infomap is excellent standalone community detection method and, by means of multilevel way partitioning method enhanced by -shell deposition, we are able to develop a parallel community discovery method on MapReduce framework.Related experiments verified the validity of the proposed work in this paper, and it may possess some reference meaning for social network analysis and social community mining with the big data techniques.Next, well try to use some overlapping partitioning methods to further improve the community detection accuracy.

Figure 4 :
Figure 4: Random walk and 2-level Huffman coding on a network with two communities.

Figure 5 :
Figure 5: The result of Figure 2 without peripheral nodes.Integers indicate node weights.

Figure 6 :
Figure 6: A schematic diagram of MapReduce process for community detection.

Figure 7 :
Figure 7: Accuracy and running time tests on different data sets.

Figure 11 :
Figure 11: Scalability tests on real networks.

Table 1 :
The map procedure of community detection on a subnetwork   .Data sets used in experiments (increment of mix is set to 0.05. = 10 6 ).