Parallelizing SLPA for Scalable Overlapping Community Detection

Communities in networks are groups of nodes whose connections to the nodes in a community are stronger than with the nodes in the rest of the network. Quite often nodes participate in multiple communities; that is, communities can overlap. In this paper, we first analyze what other researchers have done to utilize high performance computing to perform efficient community detection in social, biological, and other networks. We note that detection of overlapping communities is more computationally intensive than disjoint community detection, and the former presents new challenges that algorithm designers have to face. Moreover, the efficiency of many existing algorithms grows superlinearly with the network size making them unsuitable to process large datasets. We use the Speaker-Listener Label Propagation Algorithm (SLPA) as the basis for our parallel overlapping community detection implementation. SLPA provides near linear time overlapping community detection and is well suited for parallelization.We explore the benefits of a multithreaded programming paradigm and show that it yields a significant performance gain over sequential execution while preserving the high quality of community detection. The algorithm was tested on four real-world datasets with up to 5.5 million nodes and 170 million edges. In order to assess the quality of community detection, at least 4 different metrics were used for each of the datasets.


INTRODUCTION
A NALYSIS of social, biological, and other networks is a field which attracts significant attention as more and more algorithms and real-world datasets become available. In social science, a community is loosely defined as a group of individuals who share certain common characteristics [1]. Based on similarity of certain properties, social agents can be assigned to different social groups or communities. Communities allow researches to analyze social behaviors and relations between people from different perspectives. As social agents can exhibit traits specific to different groups and play an important role in multiple groups, communities can overlap. Usually, there is no a priori knowledge of the number of communities and their sizes. Quite often, there is no ground truth either. Knowing the community structure of a network empowers many important applications. Communities can be used to model, predict, and control information dissemination. Marketing companies, advertisers, sociologists, and political activists are able to target specific interest groups. The ability to identify key members of a community provides a potential opportunity to influence the opinion of the majority of individuals in the community. Ultimately, the entire community structure can be altered or destroyed by acting upon only a small fraction of the most influential nodes.
Biological networks such as neural, metabolic, protein, genetic, or pollination networks and food webs model interactions between components of a system that represent some biological processes [2]. Nodes in such networks often correspond to genes, proteins, individuals, or species. Common examples of interactions are infectious contacts, regulatory interaction, or gene flow.
The majority of community detection algorithms operate on networks which might have strong data dependencies between the nodes. While there are clearly challenges in designing an efficient parallel algorithm, the major factor which limits the performance is scalability. Most frequently, a researcher needs to have community detection performed for a dataset of interest as fast as possible subject to the limitations of available hardware platforms. In other words, for any given instance of a community detection problem, the total size of the problem is fixed while the number of processors varies to minimize the solution time. This setting is an example of a strong scaling computing. Since the problem size per processor varies with the number of processors, the amount of work per processor goes down as the number of processors is increased. At the same time, the communication and synchronization overhead does not necessarily decreases and can actually increase with the number of processors thus limiting the scalability of the entire solution.
Yet, there is another facet of scaling community detection solutions. As more and more hardware compute power becomes available, it seems quite natural to try uncover the community structure of increasingly larger datasets. Since more compute power currently tends to come in a form of increased processor count rather than in a single high performance processor (or a small number of such processors), it is crucial to provide enough data for each single processor to perform efficiently. In other words, the amount of work per processor should be large enough, so that communication and synchronization overhead is small relative to the amount of computation. Moreover, a well-designed parallel solution should demonstrate performance which at least doesn't degrade and hopefully even improves when run on larger and larger datasets.
Accessing data that is shared between several processes in a parallel community detection algorithm can easily become a bottleneck. Several techniques have been studied, including shared-nothing, master-slave, and data replication approaches, each having their merits and drawbacks. Shared memory architectures make it possible to build solutions that require no data replication at all since any data can be accessed by any processor. One of the key design features of our multithreaded approach is to minimize the amount of synchronization and achieve high degree of concurrency of code running on different processors and cores. Provided the data is properly partitioned, the parallel algorithm that we propose does not suffer performance penalties when presented with increasing amounts of data. Quite the contrary, results show that with larger datasets, the values of speedup avoid saturation and continue to improve up to maximal processor counts.
Validating the results of community detection algorithms presents yet another challenging task. After running a community detection algorithm how do we know if the resulting community structure makes any sense? If a network is known to have some ground truth communities then the problem is conceptually clearwe need to compare the output of the algorithm with the ground truth. It might sound like an easy problem to solve but in reality there are many possible ways to compare different community structures of the same network. Unfortunately, there is no one single method that can be used in any situation. Rather a combination of metrics can tell us how far our solution is from that represented by the ground truth. As mentioned earlier, for many real-life datasets it is not feasible to come up with any kind of ground truth communities at all. In this case comparative study of values obtained from different metrics for community structures output by different algorithms seems to be the only way of judging upon the quality of community detection.
The rest of the paper is organized as follows. An overview of relevant research on parallel community detection is presented in Section 2. Section 3 provides an overview of the sequential SLPA algorithm upon which we base our parallel implementation. It also discusses details of the multithreaded community detection on a shared-memory multiprocessor machine along with busy-waiting techniques and implicit synchronization used to ensure correct execution. We describe the way we partition the data and rearrange nodes within a partition to maximize performance. We also discuss the speedup and efficiency accomplished by our approach. Detailed analysis of the quality of community structures detected by our algorithm for four real-life datasets relative to ground truth communities (where available) and based on the sequential SLPA implementation is given in Section 4. Finally, in Section 5, some closing remarks and conclusions are provided. We also discuss some of the limitations of the presented solution and briefly describe future work.

RELATED WORK
Substantial effort has been put during the last decade into studying network clustering and analysis of social and other networks. Different approaches have been considered and a number of algorithms for community detection has been proposed. As online social communities continue to grow, and so do networks associated with them, the parallel approaches to community detection are regarded as a way to increase efficiency of community detection and therefore receive a lot of attention.
The clique percolation technique [3] considers cliques in a graph and performs community detection by finding adjacent cliques. The k-means clustering algorithm partitions m n-dimensional real vectors into k n-dimensional clusters where every point is assigned to a cluster such that the objective function is minimized [4]. The objective function is the within-cluster sum of squares of distances between each point and the cluster center. There are several ways to calculate initial cluster centers. A quick and simple way to initialize cluster centers is to take the first k points as the initial centers. Subsequently at every pass of the algorithm the cluster centers are updated to be the means of points assigned to them. The algorithm doesn't aim to minimize the objective function for all possible partitions but produces a local optima solution instead, i.e. a solution in which for any cluster the within-cluster sum of squares of distances between each point and the cluster center cannot be improved by moving a single point from one cluster to another. Another approach described in [5] utilizes an iterative scan technique in which density function value is gradually improved by adding or removing edges. The algorithm implements a shared-nothing architectural approach. The approach distributes data on all the computers in a setup and uses master-slave architecture for clustering. In such an approach, the master may easily become a bottleneck as the number of processors and the network size increases. A parallel clustering algorithm is suggested in [6], which is a parallelized version of DBSCAN [7].
A community detection approach based on propinquity dynamics is described in [8]. It doesn't use any explicit objective function but rather performs community detection based on heuristics. It relies on calculating the values of topology-based propinquity which is defined as a measure of probability that two nodes belong to the same community. The algorithm works by consecutively increasing the network contrast in each iteration by adding and removing edges in such a way as to make the community structure more apparent. Specifically, an edge is added to the network if it is not already present and the propinquity value of the endpoints of this proposed edge is above a certain threshold, called emerging threshold. Similarly, if the propinquity value of the endpoints of an existing edge is below a certain value, called cutting threshold, then this edge is removed from the network. Since inserting and removing edges alters the network topology, it affects not only propinquity between individual nodes but also the overall propinquity of the entire topology. The propinquity of the new topology can then be calculated and used to guide the subsequent changes to the topology in the next iteration. Thus, the whole process called propinquity dynamics continues until the difference between topologies obtained in successive iterations becomes small relative to the whole network.
Since both topology and propinquity experience only relatively small changes from iteration to iteration, it is possible to perform the propinquity dynamics incrementally rather than recalculating all propinquity values in each iteration. Optimizations of performing incremental propinquity updates achieve a running time complexity of O((|V |+|E|)·|E|/|V |) for general networks, and O(|V |) for sparse networks.
It is also shown in [8] that community detection with propinquity dynamics can efficiently take advantage of parallel computation using message passing. Nodes are distributed among the processors which process them in parallel. Since it is essential that all nodes are in sync with each other, the Bulk Synchronous Parallel (BSP) model is used to implement the parallel framework. In this model, the computation is organized as a series of supersteps. Each superstep consists of three major actions: receiving messages sent by other processors during the previous superstep, performing computation, and sending messages to other processors. Synchronization in BSP is explicit and takes the form of a barrier which gathers all the processors at the end of the superstep before continuing with the next superstep. Two types of messages are defined for the processors to communicate with each other. The first type is used to update propinquity maps that each processors stores locally for its nodes. Messages of the second type contain parts of the neighbor sets that a processor needs in its local computation.
A number of researchers explored a popular MapReduce parallel programming model to perform network mining operations. For example, a PeGaSus library (Peta-Scale Graph Mining System) described in [9], is built upon using Hadoop platform to perform several graph mining tasks such as PageRank calculations, spectral clustering, diameter estimation, and determining con-nected components. The core of PeGaSus is a GIM-V function (Generalized Iterated Matrix-Vector multiplication). GIM-V is capable of performing three operations: combining two values, combining all the values in the set, and replacing the old value with a new one. Since GIM-V is general, it is also quite universal. All other functions in the library are implemented as function calls to GIM-V with proper custom definitions of the three GIM-V operations. Fast algorithms for GIM-V utilize a number of optimizations like using data compression, dividing elements into blocks of fixed size, and clustering the edges. Finding connected components with GIM-V is essentially equivalent to community detection. The number of iterations required to find connected components is at most the diameter of the network. One iteration of GIM-V has the time complexity of O( |V |+|E| where P is the number of processors in the cluster. Running PeGaSus on an M45 Hadoop supercomputer cluster shows that GIM-V scales up linearly as the number of machines is increased from 3 to 90. Accordingly, PeGaSus is able to reduce time execution on real world networks containing up to hundreds of billions of edges from many hours to a few minutes.
A HEigen algorithm introduced in [10] is an eigensolver for large scale networks containing billions of edges. It is built upon the same MapReduce parallel programming model as PeGaSus, and is capable of computing k eigenvalues for sparse symmetric matrices. Similarly to PeGaSus, HEigen scales up almost linearly with the number of edges and processors and performs well up to billion edge scale networks. Its asymptotic running time is the same as that of PeGaSus' GIM-V.
In [11] the authors consider a disjoint partitioning of a network into connected communities. They propose a massively parallel implementation of an agglomerative community detection algorithm that supports both maximizing modularity and minimizing conductance. The performance is evaluated on two different threaded hardware architectures: a multiprocessor multicore Intelbased server and massively multithreaded Cray XMT2. This approach is shown to scale well on two real-world networks with up to tens of millions of nodes and several hundred million edges.
Another method for partitioning a network into disjoint communities is Scalable Community Detection (SCD) [12]. This two-phase algorithm works by optimizing the value of an objective function and is capable of processing undirected and unweighted graphs.
SCD uses a Weighted Community Clustering (WCC) metric proposed in [13] as the objective function. Instead of performing simple edge counting, WCC works with more sophisticated graph structures, such as triangles. The quality of a partition is measured based on the number of triangles in a graph. Intuitively, more connections between the nodes within a community corresponds to a larger number of triangles. Communities tend to have many highly connected nodes which are much more likely to close triangles with each other rather than with nodes from other communities. Thus, for a particular node and community the value of WCC quantifies the level of cohesion of the node to the community. The metric is defined not only for individual nodes and communities but also for a community as a whole and the entire partition. One of the advantages of WCC over modularity is that it does not have a resolution limit problem. Moreover, optimization of WCC is mathematically guaranteed to produce cohesive and structured communities. The measure of cohesion is defined for a partition as some real-valued function f called degree of cohesion. For each subset of nodes f assigns a value in the range [0, 1] such that high values of f correspond to good communities and low values of f correspond to bad communities. For a given context (social, biological, etc.) an adequate metric f captures the features specific to this context. For example, for social networks the cohesion of a community depends on the number of triangles closed by the nodes inside this community. Furthermore, triangles are also used as a good indicator of community structure.
The operation of SCD consists of two phases which are executed sequentially. The first stage comprises graph cleanup and initial partitioning. Clean up is performed by removing the edges which do not close any triangles. Then the graph is partitioned based on the values of the clustering coefficient of every node. Nodes are taken in the order of decreasing clustering coefficient and placed in communities together with their neighbors. Such partitioning yields communities with high values of WCC which is beneficial for the subsequent optimization process.
The second phase is responsible for the refinement of the initial partition. WCC optimization process consists of iterations which are repeated as long as the value of WCC for the entire partition keeps improving. In order to improve the value of WCC the best of the following three movements is chosen for each node. Those are the only movements which can potentially improve the WCC score: • Keep the network unchanged.
• Remove a node from its current community and place it in its own singleton community. • Move a node from one community to another. After movements for all the nodes have been selected, the WCC metric is calculated for the entire partition and compared to the previous value to determine if there is an overall improvement of the score. The refinement process stops when there was no improvement (or improvement was less than a specified threshold) during the most recent iteration.
Computing the value of WCC directly at each iteration and for each node is computationally expensive and therefore should be avoided, especially for high degree nodes. In order to speed up calculations it is possible to exploit the fact that the refinement process operates using the improvement of the score and therefore computing the absolute value of WCC is not necessary.
Instead of calculating WCC directly, SCD uses certain graph statistics to build a WCC estimator. The estimator evaluates the improvement of WCC only once per iteration spending just O(1) time per node.
Assuming graphs have a quasi-linear relation between the number of nodes and the number of edges and the number of iterations of the refinement process is a constant, the overall running time complexity of SCD is O(m · log n), where n is the number of nodes and m is the number of edges in the graph.
The advantage of the SCD algorithm is its amenability to parallelization. This is due to the fact that during the optimization process improvements of WCC are considered for every node individually and independently of other nodes. Therefore, the best movement can be calculated for all nodes simultaneously using whatever parallel features the underlying computing platform has to offer. Moreover, applying the moves to all nodes is also done in parallel.
SCD is implemented in C++ as a multithreaded application which uses OpenMP API for parallelization. Concurrency during the refinement process is achieved by considering improvements of WCC and then applying movements independently for each node. Benchmark datasets used in experiments include a range of networks of different sizes: Amazon, DBLP, Youtube, LiveJournal, Orkut, and Friendster. All of these graphs contain ground truth communities which are required to evaluate the quality of communities produced by SCD.
Normalized Mutual Information (NMI) and average F 1 score are used to evaluate the quality of community detection. SCD is compared against the following algorithms: Infomap, Louvain, Walktrap, BigClam, and Oslom. No distinction is made between methods which perform only disjoint community detection and those that are capable of detecting overlapping communities. The output of each algorithm is compared against ground truth communities without regard to possible overlaps. Although the values of NMI and average F 1 score obtained for SCD are close to the results of other algorithms, it outperforms its competition on almost all datasets.
In terms of the runtime performance, SCD is much faster than the majority of other algorithms used in the experiment. In a single threaded mode the largest of the datasets used (Friendster) was processed in about 12 hours. SCD scales almost linearly with the number of edges in the graph. Using multiple threads can reduce the processing time even further. With 4 threads it takes a little bit over 4 hours to perform community detection on the Friendster network. Although the values of speedup are not explicitly presented, it can be inferred that the advantage of using multiple threads varies considerably depending on the dataset. The best case seems to be with the Orkut graph for which speedup grows linearly as the number of threads is increased from 1 to 4. However, since the scope of parallelization in the experiment is modestly limited to just 4 threads, it is unclear how the scalability of the multithreaded SCD behaves when more than 4 cores are utilized.
A family of label propagation community detection algorithms includes Label Propagation Algorithm (LPA) [14], Community Overlap PRopagation Algorithm (COPRA) [15], and Speaker-listener Label Propagation Algorithm (SLPA) [16]. The main idea is to assign identifiers to nodes, and then make them transmit their identifiers to their neighbors. With node identifiers treated as labels, a label propagation algorithm simulates the exchange of labels between connected nodes in the network. At each step of the algorithm each and every node that has at least one neighbor receives a label from one of its neighbors. Nodes keep a history of labels that they have ever received organized as a histogram which captures the frequency (and therefore the rank) of each label. The number of steps, or iterations, of the algorithm determines the number of labels each node accumulates during the label propagation phase. Being one of the parameters of the algorithm, the number of iterations eventually affects the accuracy of community detection. Clearly, the running time of the label propagation phase is linear with respect to the number of iterations. The algorithm is guaranteed to terminate after a prescribed number of iterations. When it does, communities data is extracted from nodes' histories.
Staudt and Meyerhenke [17] proposed PLP, PLM, and EPP algorithms for non-overlapping community detection, i.e. determining a partitioning of the node set.
Parallel Label Propagation (PLP) algorithm is a variation of the sequential LPA capable of performing detection of non-overlapping communities in undirected weighted graphs. PLP differs from the original formulation of the Label Propagation Algorithm [14] in that it avoids explicitly randomizing the node order and relies instead on asynchronism of concurrently executed PLP code threads. This way it saves the cost of explicit randomization. In order to optimize code execution even further, nodes are divided into active nodes and inactive nodes. Since labels of inactive nodes cannot be updated in the current iteration, the label propagation process is only performed on active nodes, thus reducing the amount of computation.
The termination criterion used by PLP is also different from the original description [14]. PLP uses a threshold value to stop processing. The value of the threshold is determined empirically and set to n · 10 −5 , where n is the number of nodes in the graph. Therefore, for the majority of graphs which participated in the experiment, the number of iterations is relatively small (from 2 to about a hundred). Moreover, no justification is provided for this formula which establishes a relation between the number of iterations and the number of nodes in the graph. Although it is claimed that "clustering quality is not significantly degraded by simply omitting these iterations", it is also admitted that "while the PLP algorithm is extremely fast, its quality might not always be satisfactory for some applications". No results are presented to show how the number of iterations affects the quality of community detection or how the modularity scores of PLP compare to those of the competition.
A locally greedy, agglomerative (bottom-up) multilevel community detection method called Parallel Louvain Method (PLM) is based on modularity optimization. Starting from some initial partition, nodes are moved from one community to another as long as it increases the objective function, i.e. modularity. When modularity reaches a local optimum, a graph is coarsened and modularity optimization process is repeated.
Ensemble Preprocessing (EPP) algorithm is a combination of several community detection methods. Its main goal is to form a classifier which decides if a pair of nodes should belong to the same community. EPP requires a preprocessing step which is performed by several parallel PLP instances running concurrently. The consensus of several base classifiers is used to form core communities which are coarsened to reduce the problem size.
Ensemble Multilevel (EML) method is a recursive extension of the ensemble preprocessing algorithm. First, the core clustering is produced. Then the graph is contracted to a smaller graph and the same algorithm is called recursively. EML requires a termination condition to be used to stop the recursion.
All algorithms in [17] are created in C++. Parallel code is implemented using OpenMP API. Nodes are distributed between the threads and processed concurrently. Performance of the algorithms is compared to several other community detection methods: CLU TBB, RG, CGGC, CGGCi, and the original sequential Louvain implementation. In order to compare the quality of results produced by different algorithms Staudt and Meyerhenke use modularity [18]. Although modularity is very popular it was shown to suffer from the resolution limit, and is also known to have other issues and limitations. There are other community quality metrics as well as modified versions of the original definition of modularity which overcome some of these problems [19]. However, modularity is the only measure used to compare the quality of communities produced by different algorithms in this experiment.
A shared-memory multiprocessor machine was used to test the performance and community quality of different algorithms. EML performed poorly while PLP and PLM were found to pay off with respect to either the execution time or community detection quality.
PLP was the fastest algorithm tested. It demonstrated linear strong scaling in the range 2-16 threads for uk-2002, the largest network which participated in all experiments. No data on scaling results for other datasets were provided. Since only one graph describes speedup for PLP, it is difficult to measure the values exactly but approximately they are: 0.92 for 2 threads (i.e. slower than with a single thread), 1.45 for 4 threads, 2.6 for 8 threads, and 4.6 for 16 threads. The running time drops in a slightly sublinear manner with the number of threads, although the absolute values of speedup are quite modest, and efficiency slowly goes down from 35% for 4 threads to 29% for 16 threads.
In almost all the cases EPP was able to improve the values of modularity achieved by PLM. However, this advantage comes at the cost of running on average 10 times slower. At the same time, scalability of EPP remains unclear since no data is provided on the running time performance for different ensemble sizes.
For uk-2007-05 which was the largest graph used in the experiments, only the processing time of 120 seconds using the PLP algorithm and a parallel configuration with 32 threads is reported. No information is provided about scalability tests with this graph for other number of threads. Besides, due to memory constraints a different hardware platform with larger memory and a different CPU had to be used to process this network. Therefore, the results are not directly comparable to those of other datasets. Although it is also mentioned that "a modularity of 0.99598 is reached for the uk-2007-05 graph in 660 seconds", it is not clear under which conditions this result has been achieved. There is no mentioning of any other results concerning uk-2007-05, including any comparisons with other algorithms. Despite mentioning that uk-2007-05 requires "more than 250 GB of memory in the course of an EPP run" no EPP results for this graph are reported either.
In [20] we designed a multithreaded parallel community detection algorithm based on the sequential version of SLPA. Although only unweighted and undirected networks have been used to study the performance of our parallel SLPA implementation, an extension for the case of weighted and directed edges is straightforward and doesn't affect the computational complexity of the method. Since each edge is treated as undirected, an extra edge is added to the network for every edge of the network being read. Essentially, a network is made symmetrical, i.e. if there is an edge from some node i to some node j then there is also an edge from node j to node i. Every undirected edge is represented with two directed edges connecting two nodes in opposite directions. Although if the input network is initially undirected this can lead to doubling the number of edges that are represented internally in code, such approach is more general, and can be used for networks with directed edges as well. A distinctive feature of our parallel solution is that unlike other approaches described above, it is capable of performing overlapping community detection and has a parameter enabling balancing the running time and community detection quality.
In this paper we further explore the multithreaded parallel programming paradigm that was used in [20] and test its performance on several real-world networks that range in size from several hundred thousand nodes and a few million edges to almost 5.5 million nodes and close to 170 million edges. We also provide a detailed analysis of the quality of communities detected with the parallel algorithm.

PARALLEL LINEAR TIME COMMUNITY DE-TECTION
The SLPA [16] is a sequential linear time algorithm for detecting overlapping communities. SLPA iterates over list of nodes in the network. Each node i randomly picks one of its neighbors n i and the neighbor then selects randomly a label l from its list of labels and sends it to the requesting node. Node i then updates its local list of labels with l. This process is repeated for all the nodes in the network. Once it is completed, the list of nodes is shuffled and the same processing repeats again for all nodes. After t iterations of shuffling and processing label propagation, every node in the network has label list of length t, as every node receives one label in each iteration. After all iterations are completed, post processing is carried out on the list of labels and communities are extracted. We refer interested readers to full paper [16] for more details on SLPA.
It is obvious that the sequence of iterations executed in SLPA algorithm makes the algorithm sequential and it is important for the list of labels updated in one iteration to be reflected in the subsequent iterations. Therefore, the nodes cannot be processed completely independently of each other. Each node is a neighbor of some other nodes, therefore, if the lists of labels of its neighbors are updated, it will receive a label randomly picked from the updated list of labels.

Multithreaded SLPA with Busy-waiting and Implicit Synchronization
Our multithreaded implementation closely follows the algorithm described in [20] with minor improvements and bug-fixes. In the multithreaded SLPA we adopt a busy-waiting synchronization approach. Each thread performs label propagation on a subset of nodes assigned to this particular thread. This requires that the original network be partitioned into subnetworks with one subnetwork to be assigned to each thread. Although partitioning can be done in several different ways depending on the objective that we are trying to reach, in this case the best partitioning will be the one that makes every thread spend the same amount of time processing each node. Label propagation for any node consists of forming a list of labels by selecting a label from every neighbor of this node and then selecting a single label from this list to become a new label for this node. In other words, the ideal partitioning would guarantee that at every step of the label propagation phase each thread deals with a node that has exactly the same number of neighbors as nodes that are being processed by other threads. Thus the ideal partitioning would partition the network in such a way that a sequence of nodes for every thread consists of nodes with the same number of neighbors across all the threads. Such partitioning is illustrated in Fig. 1. T 1 , T 2 , ..., T p are p threads that execute SLPA concurrently. As indicated by the arrows, time flows from top to bottom. Each thread has its subset of nodes n i1 , n i2 , ..., n ik of size k where i is the thread number, and node neighbors are m 1 , m 2 , ..., m k . A box corresponds to one iteration. There are t iterations in total. Dashed lines denote points of synchronization between the threads.
In practice, this ideal partitioning will loose its perfection due to variations in thread start-up times as well as due to uncertainty associated with thread scheduling. In other words, in order for this ideal scheme to work perfectly, hard synchronization of threads after processing every node is necessary. Such synchronization would be both detrimental to the performance and unnecessary in real-life applications.
Instead of trying to achieve an ideal partitioning we can employ a much simpler approach by giving all the threads the same number of neighbors that are examined in one iteration of the label propagation phase. It requires providing each thread with such a subset of nodes that the sum of all indegrees is equal to the sum of all indegrees of nodes assigned to every other thread. In this case for every iteration of the label propagation phase every thread will examine the same overall number of neighbors for all nodes that are assigned to this particular thread. Therefore, every thread will be performing, roughly, the same amount of work per iteration. Moreover, synchronization then is only necessary after each iteration to make sure that no thread is ahead of any other thread by more than one iteration. Fig. 2 illustrates such partitioning. As before, T 1 , T 2 , ..., T p are p threads that execute SLPA concurrently. As shown by the arrows, time flows from top to bottom. However each thread now has its subset of nodes n i1 , n i2 , ..., n iki of size k i where i is the thread number. In other words, threads are allowed to have different number of nodes that each of them processes, as long as the total number of node neighbors M = ki i=1 m i is the same across all the threads. A box still corresponds to one iteration. There are t iterations in total. Dashed lines denote points of synchronization between the threads.
We can employ yet an even simpler approach of just splitting nodes equally between the threads in such a way that every thread gets the same (or nearly the same) number of nodes. It is important to understand that this approach is based on the premise that the network has small variation of local average of node degrees across all possible subsets of nodes of equal size. If this condition is met, then, as in the previous case, every thread performs approximately the same amount of work per iteration. Our experiments show that for many real-world networks this condition holds, and we accepted this simple partitioning scheme for our multithreaded SLPA implementation.
Given the choice of the partitioning methods described above, each of the threads running concurrently is processing all the nodes in its subset of nodes at every iteration of the algorithm. Before each iteration, the whole subset of nodes processed by a particular thread needs to be shuffled in order to make sure that the label propagation process is not biased by any particular order of processing nodes. Moreover, to guarantee the correctness of the algorithm, it is necessary to ensure that no thread is more than one iteration ahead of any other thread. The latter condition places certain restriction on the way threads are synchronized. More specifically, if a particular thread is running faster than the others (whatever the reasons for this might be) it has to eventually pause to allow other threads to catch up (i.e. to arrive at a synchronization point no later than one iteration behind this thread). This synchronization constraint limits the degree of concurrency of this multithreaded solution.
It is important to understand the importance of partitioning the network nodes into subsets to be processed by the threads in respect to the distribution of edges across different network segments. In our implementation we use a very simple method of forming subsets of nodes for individual threads. First, a subset for the first thread is formed. Nodes are read sequentially from an input file. As soon as a new node is encountered it is added to the subset of nodes processed by the first thread. After the subset of nodes for the first thread has been filled, a subset of nodes for the second thread is formed, and so on. Although simple and natural, this approach works well on networks with high locality of edges. For such networks, if the input file is sorted in the order of node numbers, nodes are more likely to have edges to other nodes that are assigned to the same thread. This leads to partitioning where only a small fraction (few percent) of nodes processed by each thread have neighbors processed by other threads.
Algorithm 1 shows the label propagation phase of our multithreaded SLPA algorithm which is executed by each thread. First, each thread receives a subset of nodes that it processes called T hreadN odesP artition. An array of dependencies U sed is first initialized and then filled in such a way that it contains 1 for all threads that process at least one neighbor of the node from T hreadN odesP artition and 0 otherwise. This array of dependencies U sed is then transformed to a more compact representation in the form of a dependency array D. An element of array D contains thread number of the thread that processes some neighbor of a node that this thread processes. Dsize is the size of array D. If no node that belongs to the subset processed by this thread has neighbors processed by other threads, then array D is empty and Dsize = 0. If, for example, nodes that belong to the subset processed by this thread have neighbors processed by threads 1, 4, and 7, then array D has three elements with values of 1, 4, and 7, and Dsize = 3. After the dependency array has been filled, the execution flow enters the main label propagation loop which is controlled by counter t and has maxT iterations. At the beginning of every iteration we ensure that this thread is not ahead of the threads on which it depends by more than one iteration. If it turns out that it is ahead, this thread has to wait for the other threads to catch up. Then the thread performs a label propagation step for Add label l to labels of v end for t ← t + 1 end while In order to even further alleviate the synchronization burden between the threads and minimize the sequentiality of the threads as much as possible, another optimization technique can be used. We note that some nodes which belong to a set processed by a particular thread have connection only to nodes that are processed by the same thread (we call them internal nodes) while other nodes have external dependencies. We say that a node has an external dependency when at least one of its neighbors belongs to a subset of nodes processed by some other thread. Because of nodes with external dependencies, synchronization rules described above must be strictly followed in order to ensure correctness of the algorithm and meaningfulness of the communities it outputs. However nodes with no external dependencies can be processed within a certain iteration independently from the nodes with external dependencies. It should be noted that a node with no external dependencies is not completely independent from the rest of the network since it may well have neighbors of neighbors that are processed by other threads.
It follows that processing of nodes with no external dependencies has to be done within the same iteration framework as for nodes with external dependencies but with less restrictive relations in respect to the nodes processed by other threads. In order to utilize the full potential of the technique described above, it is necessary to split the subset of nodes processed by a thread into two subsets, one of which contains only nodes with no external dependencies and the other one contains all the remaining nodes. Then during the label propagation phase of the SLPA nodes that have external dependencies are processed first in each iteration. Since we know that by the time such nodes are processed the remaining nodes (ones with no external dependencies) cannot influence the labels propagated to nodes processed by other threads (due to the symmetry of the network) it is safe to increment the iteration counter for this thread, thus allowing other threads to continue their iterations if they have been waiting for this thread in order to be able to continue. Meanwhile this thread can finish processing nodes with no external dependencies and complete the current iteration.
This approach effectively allows a thread to report completion of the iteration to the other threads earlier than it has in fact been completed by relying on the fact that the work which remains to be completed can not influence nodes processed by other threads. This approach, though seemingly simple and intuitive, leads to noticeable improvement of the efficiency of parallel execution (as described in Section 3.2) mainly due to decreasing the sequentiality of execution of multiple threads by signaling other threads earlier than in the absence of such splitting.
An important peculiarity arises when the number of nodes with external dependencies is only a small fraction of all the nodes processed by the thread (few percent). In this case it would be beneficial to add some nodes without external dependencies to the nodes with external dependencies and process them together before incrementing the iteration counter. The motivation here is that nodes must be shuffled in each partition separately from each other to preserve the order of execution between partitions. Increasing partition size above the number of external nodes improves shuffling in the smaller of the two partitions.
The remaining nodes without external dependencies can be processed after incrementing the iteration counter, as before. In order to reflect this optimization factor we introduce an additional parameter called the splitting ratio. A value of this parameter indicates the percentage of nodes processed by the thread before incrementing the iteration counter. For instance, if we say that splitting of 0.2 is used it means that at least 20% of nodes are processed before incrementing the iteration counter. If after initial splitting of nodes into two subsets of nodes with external dependencies and without external dependencies it turns out that there are too few nodes with external dependencies to satisfy the splitting ratio, some nodes that have no external dependencies are added to the group of nodes with external dependencies just to bring the splitting ratio to the desired value.
Algorithm 2 shows our multithreaded SLPA algorithm that implements splitting of nodes processed by a thread into a subset of nodes with external dependencies and a subset with no external dependencies. The major difference from Algorithm 1 is that instead of processing all the nodes before incrementing the iteration counter, we first process a subset of nodes that includes nodes that have neighbors processed by other threads, then we increment the iteration counter, and then we process the rest of the nodes. Since in [20] we studied the impact of selecting different values of the splitting ratio, it wasn't our main focus here. We simply accepted a splitting ratio of 0.2, and kept it fixed for all the test runs. Our major objective was to ensure that all parallel and sequential runs are performed with exactly the same code base and provide identical runtime conditions and parameters, so that results of our performance evaluation and community detection quality metrics are directly comparable.

Algorithm 2 : Multithreaded SLPA with splitting of nodes
Internal ← CreateInternalP artition(InputF ile) External ← CreateExternalP artition(InputF ile) p ← number of threads /* Unchanged code from Algorithm 1 omitted */ while t < maxT do for j = 0 to j < Dsize − 1 do while t − t of thread D[j] > 1 do Do nothing end while end for for all v such that v is in External do l ← selectLabel(v) Add label l to labels of v end for t ← t + 1 for all v such that v is in Internal do l ← selectLabel(v) Add label l to labels of v end for end while

Performance Evaluation of the Multithreaded Solution
We performed runs on a hyper threaded Linux system operating on top of a Silicon Mechanics Rackform nServ A422.v3 machine (GANXIS.nest.rpi.edu). Processing power was provided by 64 cores organized as four AMD Opteron TM 6272 central processing units Four datasets have been used to test the performance of the multithreaded solution and the quality of community detection. Three of these datasets (com-Amazon, com-DBLP, and com-LiveJournal) have been acquired from Stanford Large Network Dataset Collection (http://snap.stanford.edu/data) which contains a selection of publicly available real-world networks (SNAP networks).
Undirected Amazon product co-purchasing network (referred to as com-Amazon) was gathered, described, and analyzed in [21]. From the dataset information [22], it follows that it was collected by crawling Amazon website. A Customers Who Bought This Item Also Bought feature of the Amazon website was used to build the network. If it is known that some product i is frequently bought together with product j, then the network contains an undirected edge from i to j. For each product category defined by Amazon there is a corresponding ground truth community. Each connected component in a product category is treated as a separate ground truth community.
Since small ground truth communities having less than 3 nodes have been removed, it was necessary to modify the original com-Amazon network to ensure that only nodes that belong to ground truth communities can appear in communities detected by the multithreaded parallel algorithm. Otherwise, comparison of communities produced by the community detection algorithm and the ground truth communities would not be feasible. The modified com-Amazon network was obtained from the original one by removing nodes which are not found in any ground truth community and all the edges connected to those nodes. While the original Amazon network consists of 334,863 nodes and 925,872 undirected edges, the modified dataset has 319,948 nodes and 1,760,430 directed edges. As outlined in Section 2, each undirected edge is internally converted to a pair of edges. Therefore, 925,872 undirected edges from the original network correspond to 1,851,744 directed edges in the internal representation of the code, and since some of the edges were incident to removed nodes, the resulting number of directed edges left in the network was 1,760,430.
The DBLP computer science bibliography network (referred to as com-DBLP) was also introduced and studied in [21]. According to the dataset information [23], it provides a comprehensive list of research papers in computer science. If two authors publish at least one paper together, then the nodes corresponding to these authors will be connected with an edge in a co-authorship network. Ground truth communities are based on authors who published in journals or conferences. All authors who have at least one publication in a particular journal or conference form a community. Similarly to the com-Amazon network, each connected component in a group is treated as a separate ground truth community. Small ground truth communities (less than 3 nodes) have also been removed.
The DBLP dataset was also modified to facilitate comparison with ground truth communities as described above for the com-Amazon network. Since DBLP is also undirected, the same considerations about the number of edges that were provided above for the com-Amazon network also apply to com-DBLP. The original DBLP network contains 317,080 nodes and 1,049,866 undirected edges, while the modified version has 260,998 nodes and 1,900,118 directed edges.
Another network from [21] that we are using to evaluate the performance of the multithreaded parallel implementation of SLPA and the quality of communities it produces is a LiveJournal dataset (referred to as com-LiveJournal). The dataset information page [24] describes LiveJournal as a free on-line blogging community where users declare friendship with each other. LiveJournal users can form groups and allow other members to join them. For the purposes of evaluating the quality of communities we are treating the com-Livejournal network as having no ground truth communities. The LiveJournal network is undirected and contains 3,997,962 nodes and 34,681,189 pairs of directed edges. Since we are not comparing the communities found by the community detection algorithm with the ground truth communities, no modification of the original network is necessary.
The fourth dataset is a snapshot of the Foursquare network as of October 11, 2013. This dataset contains 5,499,157 nodes and 169,687,676 edges. There is no information about ground truth communities available.
We calculated speedup using formula shown in (1) and efficiency according to (2).
where Speedup is the actual speedup calculated according to equation 1 and p is the number of processors or computing cores.
All the experiments were run with 1,000 iterations (the value of maxT was set to 1000) for all networks. On one hand, a value of 1,000 for the number of iterations provides a sufficient amount of work for the parallel portion of the algorithm, so that the overhead associated with creating and launching multiple threads does not dominate the label propagation running time. On the other hand, 1,000 iterations is empirically enough to produce meaningful communities since the number of labels in the history of every label is statistically significant. At the same time, although running the algorithm for 1,000 iterations on certain datasets (especially larger ones) was in some cases (mainly for smaller core count) taking a few days, it was still feasible to complete all runs on all four networks in under two weeks.
We conducted one set of measurements by considering only time for the label propagation phase since it is this stage that differs in our multithreaded implementation from the original sequential version. Time necessary to read an input file and construct in-memory representation of the nodes and edges as well any auxiliary data structures was not included in this timing. All postprocessing steps and writing output files have also been excluded.
However, for an end user it is not the label propagation time (or any other single phase of the algorithm) that is important but rather the total running time. Users care about the time it took for the code to run: from the moment a command was issued till the resulting communities files have been written to a disk. Therefore, we conducted a second set of measurements to gather data on total execution time of our multithreaded parallel SLPA implementation. Since the total execution time includes not only a highly parallel label propagation stage but also file I/O, threads creation and cleanup, and other operations which are inherently sequential, it is to be expected that the values of both speedup and efficiency are going to be worse than in the case when only label propagation phase was considered.
Since the hardware platform we used provides 64 cores, every thread in our tests executes on its dedicated core. Therefore threads do not compete for central processing unit (CPU) cores (unless there is interference from the operating system or other user processes running concurrently). They execute in parallel, and we can completely ignore thread scheduling issues in our considerations. Because of this we use terms 'thread' and 'core' interchangeably when we describe results of running the multithreaded SLPA. The number of cores in our runs varies from 1 to 64. However, we observed a performance degradation for a number of threads larger than 32. This performance penalty is most likely caused by the memory banks organization of our machine. Speedup and efficiency are calculated using (1) and (2) defined earlier. No third-party libraries or frameworks have been used to set up and manage threads. Our implementation relies on Pthreads application programming interface (POSIX threads) which has implementations across a wide range of platforms and operating systems.
We noticed that compiler version and compilation flags can play a crucial role not only in terms of how efficiently the code runs but in the sole ability of code to execute in the multithreaded mode. Unfortunately little if anything is clearly and unambiguously stated in compiler documentation regarding implications of using various compiler flags to generate code for execution on multithreaded architectures. For the most part, developers have to rely on their own experience or common sense and experiment with different flags to determine the proper set of options which would make compiler generate effective code capable of flawlessly executing multiple threads.
For instance, when compiler runs with either -O2 or -O3 optimization flag to compile the multithreaded SLPA the resulting binary code simply deadlocks at execution. The reason for deadlock is exactly the optimization that compiler performs ignoring the fact that the code is multithreaded. This optimization leads to threads being unable to see updates to the shared data structures performed by other threads. In our case such shared data structure is an array of iteration counters for all the threads. Evidently, not being able to see the updated values of other threads' counters quickly leads threads to a deadlock.
Another word of caution should be offered regarding some of the debugging and profiling compiler flags. More specifically, compiling code with -pg flag which generates extra code for a profiling tool gprof leads to substantial overhead when the code is executed in a multithreaded manner. The code seems to be executing fine but with a speedup of less than 1. In other words, the more threads are used the longer it takes for the code to run regardless of the fact that each thread is executed on its own core and therefore does not compete with other threads for CPU and that the more threads are used the smaller is a subset of nodes that each thread processes.
The results of performance runs of our multithreaded parallel implementation are presented in Figures 3-19 below. (Data export was performed using Daniel's XL Toolbox add-in for Excel, version 6.51, developed by Daniel Kraus, Würzburg, Germany.)      number of cores varying from 1 to 64. It can be seen that for smaller core counts the time decreases nearly linearly with the number of threads. For larger number of cores the label propagation time continues to improve but at a much slower rate. In fact, for 32 cores and more, there is almost no improvement of the label propagation time on smaller datasets (com-Amazon and com-DBLP). At the same time, larger datasets (com-LiveJournal and Foursquare) improve label propagation times all the way through 64 cores. As outlined in Section 1, this is clearly something to be expected since in a strong scaling setting enough workload should be supplied to parallel processes to compensate for the overhead of creating multiple threads and maintaining communication between them. This trend is even more evident in Figures 4, 6, 8, and 10 which plot the values of speedup and efficiency for the four datasets (com-Amazon, com-DBLP, com-LiveJournal, and Foursquare, respectively) and the number of cores from 1 to 64. As the number of cores increases, the speedup also grows but not as fast as the number of utilized cores, so efficiency drops. The saturation of speedup is quite evident for smaller networks (com-Amazon and com-DBLP) and corresponds to regions with no improvement of the label propagation time that we noticed earlier. Similarly, the values of speedup continue to improve (although at decreasing rates) for larger datasets (com-LiveJournal and Foursquare) even at 64 cores. Nonetheless, the efficiency degrades since speedup gains are small relative to an increase in core count. Such behavior can be attributed to several factors. First of all, as the number of cores grow while the network (and hence the number of nodes and edges) stays the same, each thread gets fewer nodes and edges to process. In limit, it can cause the overhead of creating and running threads to outweigh the benefits of parallel execution for a sufficiently small number of nodes. Furthermore, as the number of cores grows, the number of neighbors of nodes with external dependencies increases (both because each thread gets fewer nodes and there are more threads to execute them). More nodes with external dependencies, in turn, means that threads are more dependent on other threads.
However, for the sake of fair data interpretation it should be reminded that the definition of efficiency which we are using here is based on Equation 2. It only takes into account the parallel execution speedup observed on a certain number of cores. The cost of cores is completely ignored in this formula. More realistically, the cost should be considered as well. The price paid for a modern computer system is not linear with the number of processors and cores. Within a certain range of the number of cores per system as the architecture moves towards higher processor and core counts, each additional core costs less. That is why, the pure parallel efficiency defined by Equation 2 should be effectively multiplied by the cost factor for making decisions regarding the choice of hardware to run community detection algorithms on real-life networks. After such multiplication, the efficiency including cost is going to be much more favorable to higher core counts than the efficiency given by Equation 2. Figure 11 combines plots of speedup values based on the label propagation time for all four datasets. Overall, the values of speedup do not vary considerably between the networks used in the experiments. However, it is quite evident that the shape of the curves is slightly different. On one hand, there is com-Amazon and com-DBLP for which the values of speedup reach local maximum at fewer than maximal number of cores. On the other hand, speedup values for com-LiveJournal and Foursquare are strictly increasing as the number of cores ranges between 1 and 64. This observation is just another evidence of the behav-ior discussed earlier. Smaller networks are too small to effectively use large core counts which leads to the saturation of speedup. The performance of multithreaded parallel SLPA on larger datasets continues to improve at almost a constant rate in a wide range of core counts between 4 and 64. It is also worth noting that as long as a network is large enough to justify the overhead of multithreaded execution, different datasets yield almost identical speedup values. Although more testing would be required to firmly assert that speedup is independent of the size of the dataset, such behavior would be easy to explain. Indeed, speedup performance of the algorithm depends primarily on the properties of the graph (e.g. the number of edges crossing the boundary between the node sets processed by different cores) rather than on the size of the network. Such a feature is quite desirable in community detection since it enables the application to provide a user with an estimate of the overall execution time once the network is loaded and partitioned between the cores.   Figures 12, 14, 16, and 18 present the total community detection time of the multithreaded parallel SLPA on four datasets (com-Amazon, com-DBLP, com-LiveJournal, and Foursquare, respectively) for the number of cores varying from 1 to 64. Although, clearly the total running time exceeds the label propagation phase, the difference in many cases is not that significant. This is especially true for larger datasets (com-LiveJournal and      Foursquare) which, as we discussed above, is something to be expected. The fact that the label propagation phase is a dominating component of the total running justifies our efforts to increase performance by replacing sequential label propagation with a parallel implementation.
The values of speedup and efficiency calculated based on the total execution time rather than label propagation time are plotted in Figures 13,15,17,and 19 for the four datasets (com-Amazon, com-DBLP, com-LiveJournal, and Foursquare, respectively) and the number of cores between 1 and 64. Although these values are worse than those calculated based only on the label propagation time, they provide a more realistic view of the end-to-end performance of our multithreaded SLPA implementation. In real life the speedup values of around 5 to 6 still constitute a substantial improvement over the sequential implementation meaning, for example, that you would only have to spend 8 hours waiting for your community detection results instead of 2 days. Figure 20 shows combined plots of speedup values for all four datasets considering the total execution time. Just like in Figure 11, the values of speedup for different networks are quite similar. The same two types of curves are observed which correspond to a group of relatively small (com-Amazon and com-DBLP) and large (com-LiveJournal and Foursquare) networks.
However, there are a some differences. First, the absolute values of speedup are lower when we consider the total execution time instead of just the label propagation phase. This is clearly something to be expected since the total execution time includes many operations which cannot be made efficiently parallel (reading the input graph and writing output communities, partitioning the network between the cores, etc.) Second, the difference in speedup for different datasets even within the same group (e.g. large datasets) is greater than it was in Figure 11. The reason for that is also the effect of the limiting factor of sequential operations. Since we are considering the total execution time here, the size of the dataset affects speedup more significantly than in the case when only label propagation time was taken into account.

THE QUALITY OF COMMUNITY DETECTION
In this part, we will evaluate the quality of the community structure detected with multithreaded version of SLPA [16] on the four datasets, Amazon, DBLP, Foursquare, and LiveJournal, introduced in Section 3.2. Amazon and DBLP have ground truth communities, while Foursquare and LiveJournal do not. Our only concern here is whether the community structure discovered by multithreaded SLPA has the quality similar to that detected by sequential SLPA [16] since we have already shown the effectiveness of sequential SLPA, compared with other community detection algorithms, in [16] and [25]. Each metric value in the following tables is the average of ten community detection results with threshold r of SLPA being r = 0.01, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, and 0.45.
We calculate Variation of Information (V I), Normalized Mutual Information (N M I), F-measure, and Normalized Van Dongen metric (N V D) [19] of the community structures detected by sequential SLPA and multithreaded SLPA on Amazon and DBLP, presented in Table 1 and Table 2. Notice that V I, N M I, F-measure, and N V D are intended to measure the quality of disjoint communities. However, we could still use them here to evaluate the quality of overlapping communities, although the values of N M I, F-measure, and N V D may not be in the range of [0, 1]. There are mainly two reasons why we adopt their disjoint versions. On one hand, we are only concerned whether multithreaded SLPA has almost the same performance with sequential SLPA, so, in other words,   Table 1 and Table 2 that the metric values of the community structures detected by sequential SLPA and multithreaded SLPA on Amazon and DBLP are very close with each other, which indicates that multithreaded SLPA has almost the same performance with sequential SLPA on Amazon and DBLP. We then compute modularity (Q) [18], Intra-density, Contraction, Expansion, and Conductance [21], [27] of the community structures found by sequential SLPA and multithreaded SLPA on Foursquare and LiveJournal, presented in Table 3 and Table 4. Notice that the modularity we adopt here is also applicable to disjoint communities, so its value may not be in the range of [0, 1]. The reasons why we use the disjoint version of modularity are similar as in the case of V I, N M I, F-measure, and N V D. In addition, there are several overlapping versions for modularity [28], [29], [30], [31], [32], [33], however which one is the best. Table 3 and Table 4 show that the metric values of the community structures detected by sequential SLPA and multithreaded SLPA on Foursquare and LiveJournal are also very close with each other, which implies multithreaded SLPA has almost the same performance with sequential SLPA on Foursquare and LiveJournal.
Comparisons between different community detection algorithms are not always easy to make due to substantially different implementations which might require even mutually exclusive architectural features or software components (shared vs. distributed memory machines, programming languages compiled to native code vs. development systems based on virtual machines or interpretation, and so on).
It is also important to consider the type of output communities produces by an algorithm. As mentioned earlier, overlapping community detection is more computationally intensive than disjoint. While the majority of other parallel solutions perform only disjoint community detection, our multithreaded SLPA can produce either disjoint or overlapping communities, depending on the value of threshold r.
Even though execution time is certainly one of the most important performance measures for an end user, it is often not suitable for direct comparisons between different implementations of community detection methods. Unlike execution time which depends on specific hardware, operating systems, code execution environments, compiler optimizations, and other factors, speedup evens out architectural and algorithmic differences. It is therefore a much better way to compare runtime performance of community detection algorithms.
Another important factor that makes it hard to compare the results produced by competing methods is the use of different datasets. Although several datasets seem to appear more often than the others (e.g. Amazon, DBLP, LFR, etc.) there is no established set of datasets which were publicly available and widely accepted as a benchmark for high performance community detection. If such a benchmark existed, it should have contained a balanced blend of both real world and synthetic datasets of varying size (from hundreds of thousands of nodes and edges to billion scale networks) carefully selected so that it does not give a priori advantage to any of the possible approaches to community detection.
There are datasets which are supplied with so called ground truth communities, although in some cases it is very questionable whether these communities in fact represent the ground truth. For other networks it is not feasible to establish the ground truth at all. Again, there is no established consensus on whether datasets with or without ground truth communities (or a combination of both types) should be evaluated. Different researchers approach this problem differently, mainly depending on the datasets they have access to. There is also a problem of using proprietary datasets which might not be available to other researchers to test their community detection implementations.
Besides using different datasets, researchers also use different metrics to evaluate the quality of community detection. A decade or so ago, modularity was the dom-  inating player on the community quality field. However, after it was discovered that the original formulation of modularity suffers from several drawbacks, a number of new or extended metrics have been proposed and a number of old, almost forgotten methods have been rediscovered. A detailed review of different existing and emerging metrics can be found in [19]. Still, there is no agreement on which metric (or a combination of thereof) should be sufficient to quantify the quality of community detection performed by a certain algorithm. From all of the above it follows that performing fair, "apples to apples" comparisons of different community detection implementations is hard. To take just one example, let's consider PLP/EPP, SCD, and multithreaded SLPA: • Both PLP/EPP and SCD methods (see Section 2) are only able to detect disjoint communities while multithreaded SLPA performs overlapping community detection. • Experiments with SCD were only conducted with the number of threads ranging from 1 to 4. In contrast, in our approach described in Section 3, we evaluate the method and show its scalability for all datasets being tested, including large graphs, and the number of cores ranging from 1 to 64. PLP was tested for a slightly wider range of parallel configurations (1 to 16 threads) but only for one dataset, uk-2002. For a Foursquare network which has the size similar to the size of uk-2002, the values of speedup demonstrated by multithreaded SLPA (see Figure 19) are comparable to the results of SCD described in Section 2. • Modularity is the only measure of community quality considered by PLP/EPP. SCD uses NMI and average F 1 score. Multithreaded SLPA uses several different metrics, including N M I and F-measure. However, for the reasons explained above the values of N M I and F-measure may not be in the conventional range of [0, 1]. Therefore, it is not feasible to directly compare the values of community quality metrics obtained in our experiments with the SCD results.
In conclusion, the community structure found by multithreaded SLPA has almost the same quality with that discovered by sequential SLPA. Moreover, we have demonstrated in [16] and [25] that sequential SLPA is very effective compared with other community detection algorithms, which implies the effectiveness of multithreaded SLPA on community detection.

CONCLUSION AND FUTURE WORK
In this paper, we evaluated the performance of a multithreaded parallel implementation of SLPA and showed that using modern multiprocessor and multicore architectures can significantly reduce the time required to analyze the structure of different networks and output communities. We found that despite the fact that the rate of speedup slows down as the number of processors is increased, it still pays off to utilize as many cores as the underlying hardware has available. Our multithreaded SLPA implementation was proven to be scalable both in terms of increasing the number of cores and analyzing increasingly large networks. Moreover, the properties of the detected communities closely match those produced by the base sequential algorithm, as verified using several metrics. Given a sufficient number of processors, the parallel SLPA can reliably process networks with millions of nodes and accurately detect meaningful communities in minutes and hours. In our future work, we plan to explore other parallel programming paradigms and compare their performance with our multithreaded approach.

ACKNOWLEDGMENTS
Research was sponsored in part by the Army Research Laboratory under Cooperative Agreement Number W911NF-09-2-0053, by the EU's 7FP Grant Agreement No. 316097 and by the Polish National Science Centre, the decision no. DEC-2013/09/B/ST6/02317. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government.