Detecting Local Community Structures in Networks Based on Boundary Identification

Detecting communities within networks is of great importance to understand the structure and organizations of real-world systems. To this end, one of the major challenges is to find the local community from a given node with limited knowledge of the global network. Most of the existing methods largely depend on the starting node and require predefined parameters to control the agglomeration procedure, which may cause disturbing inference to the results of local community detection. In this work, we propose a parameter-free local community detecting algorithm, which uses two self-adaptive phases in detecting the local community, thus comprehensively considering the external and internal link similarity of neighborhood nodes in each clustering iteration. Based on boundary nodes identification, our self-adaptive method can effectively control the scale and scope of the local community. Experimental results show that our algorithm is efficient and well-behaved in both computer-generated and real-world networks, greatly improving the performance of local community detection in terms of stability and accuracy.


Introduction
Nowadays, most real world networks demonstrate that the nodes contained in their certain parts are densely connected to each other, which are usually called communities.Efficiently identifying those communities can help us to know the nature of those networks better and facilitate the analysis on those social networks.In many applications [1][2][3], the discovery of communities is an important issue in the analysis of complex networks, such as email networks, the World Wide Web, and citation networks.To better understand the topological properties and organizational structures of the complex systems [4,5], many community detection algorithms have recently been developed by using techniques such as information theory [6], hierarchical clustering [7], and modularity optimization [8].
But when we have not got the entire information or structure of the networks due to their large scale and dynamics, these above global algorithms may not be applicable for detecting modules, especially in incomplete networks.Therefore, it is desirable to use limited available information of network graphs to discover the local community from a certain source node.Recently, numbers of community identification algorithms [9][10][11][12][13][14] were proposed and widely applied to discover the local structure under the partial knowledge of the networks.However, most of these methods need to preset some parameters like local similarity threshold for node clustering and provide no automated way to find them.The stopping criteria of the local community detection greatly depend on the predefined parameters which control how strong the community should be.More seriously, the performances of these algorithms are greatly limited by the source node's position seriously, especially when the source node belongs to an obscure community that may introduce a large number of outliers.
In this paper, we suggest a new parameter-free method to discover the local community by using the incomplete information of known structure in networks.In our method, the local community  is regarded as a supernode.From the view of every potential clustering node having links with the local community, we calculate its link similarity with every neighborhood node, respectively, instead of just considering its link relationship with local community as most existing algorithms do and then cluster the node having the maximum link similarity with the local community.Through the two self-adaptive phases of community detection, the method iteratively identifies the surrounding boundary of the current local community and updates the neighborhood of the local community for the next round of node clustering.Consequently, this mechanism guarantees that our method can add suitable nodes into the local community in the condition of no parameters predefined and finally obtain an accuracy and meaningful community from any given node.The main contributions are summarized as follows.
(1) In the agglomeration of the local community, for each neighborhood node, we comprehensively consider its link similarity with the local community and with other adjacent nodes, rather than just depending on its link relationship with the local community like most existing methods.This mechanism ensures that each expansion of the local community is reasonable and achieves high accuracy.
(2) By combining the advantages of boundary detection and link-similarity-based modularity optimization, we can automatically obtain the precise coverage of the local community in networks and make the quality of the discovering community optimal.So our algorithm needs not to preset some certain thresholds to stop the clustering and overcomes the resolution limitation that other modularity-based algorithms suffer from.
(3) Our algorithm has good robustness and stability.The community results do not depend on the position and topological features of the given nodes.Experimental results show that our algorithm is effective and efficient in both computer-generated networks and real-world networks, regardless of what kind of source nodes is given.
This paper is organized as follows.We review the existing local community detection methods and analyze the problems of these approaches in Section 2. In Section 3, we formalize the notions used in our method.Then the algorithm is described in detail in Section 4. Section 5 presents our experiment results and Section 6 concludes the paper.

Related Work
Here we have reviewed and compared several typical approaches to study the local community structure in networks.From the given node, these algorithms [9][10][11] proposed their different corresponding metrics to cluster the adjacent nodes.These metrics determine which node can be clustered into the community and finally form the local community structure.Clauset [9] defined local community modularity : where  in is the number of edges that connect community boundary nodes and nodes in the community, while  out is the number of edges that connect boundary nodes and nodes outside the community.The algorithm iteratively clusters the neighboring node yielding the largest increase of , until the community has reached a predefined size.
Luo et al. proposed a new modularity  for local community evaluation [10]. is the ratio of internal edges  in and external edges  out , where modularity  indicates the ratio of internal edges  in and external edges  out .Nodes will be added or removed from  if and only if it can cause an increase in .This algorithm turns out to result in high recall but low accuracy.
In paper [11], a fitness function is defined, where   in and   ex are the total internal and external degrees of the nodes of module  and  is the resolution parameter, which is used for controlling the size of the communities.
Recently, several novel metrics for local community structure have also been presented in [12][13][14][15].Combined with the advantages of these quality metrics, the existing algorithms have made great improvement in finding a more precise and complete community.Furthermore, some other methods have been proposed by using techniques such as maximum cliques extension [16] and clustering with selection probability [17] or are based on density drop of subgraphs [18].But all of these methods have not really solved the problem of when to stop the clustering iteration, so they have to preset some manual parameters to control the expansion of the local community.These parameters are hard to be accurately acquired and may greatly affect the final results of community detection.More seriously, the performances of the approaches are sensitive to the given node and may be greatly confined by its location or topology.When starting from different nodes, even when they actually belong to the same community, the detecting results of local communities may be very different.To deal with the problems above, we introduce link similarity to measure the connecting tightness between the indeterminate neighborhood node and the local community, reveal the natural community structure based on boundary identification, and finally control the community size and scale via local optimization of the similaritymodularity measure.

Preliminary
Considering a network (, ) composed of the node set  and the edge set , (, ) represents the weight of the edge connecting nodes  and .Given a node , our work is to discover the entire local community that  belongs to.In this section, we formalize some notions of the local community and introduce a structural similarity-based metric.
Definition 1 (neighborhood).The structure neighborhood [12,19] of node  is the set () = {V ∈  | ∃ V ∈ } , which contains node  and its adjacent nodes which have links with .By definition, () is comprised of  and the adjacent nodes which directly connect with node .For the local community  (shown in Figure 1), () = {V ∈  −  | ∃ ∈ , s.t. V ∈ } denotes the neighborhood of community ,  ⊆ (), and  =  − () is the outside node set in networks.In order to quantify the connecting relationship between nodes, here we adopt a structural similarity measure which is extended from the cosine similarity [20].That metric effectively denotes the link similarity of any two adjacent nodes in the network and reflects the local structural similarity between them.
Definition 2 (link similarity).Given a weighted network (, ), one can get the neighborhood structure of the node  and node  to evaluate their similarity of link density and link formation [14].The link similarity between two nodes  and  is defined as When considering an unweighted graph, the weight (, ) of an edge   ∈  can be set to 1.It is easy to understand that the adjacent nodes sharing more common links have larger local similarity and connection tightness and are more likely to be clustered into the local community.
Generally, a network community is regarded as a group of nodes that are more densely connected inside the group than the outside of the network.The topology and organization of a community indicate its structural and functional characteristics that distinguish it from other communities in networks.Based on the above analysis, we can conclude that link similarity can be utilized as a criterion to determine whether the candidate node should be added in the community or not.Moreover, our experimental results also proved the effectiveness of that metric.
The definition means the nodes inside () have less structural similarity with  than with nodes or subgraph outside .As shown in Figure 1, the community structure usually characterizes that there are more interactions among its members than the remainder of the network; the boundary of the community is naturally formed by the border nodes which are far from the core part of their located community.Through detecting of the boundary nodes, we can determine the borders of the local community and make the local community detection under effective control avoid introducing the outliers during the process of the expansion by clustering its adjacent nodes.

The Two-Phase Local Community Detection Based on Boundary Identification
In this section, we describe our local community detection algorithm consisting of two phases: the agglomeration phase and the boundary identification phase.After identifying an accurate agglomeration, it is necessary for the local method to determine when to stop clustering nodes appropriately [19].So our algorithm is developed by integrating the two phases to enhance the neighborhood node clustering in each iteration.Moreover, a link-similarity-based modularity is adopted to evaluate the quality of the local community and help to identify the border nodes.
In the agglomeration phase, we consider the local community  as a supernode.For each neighborhood node of , we, respectively, calculate its link similarity with every adjacent node.Assuming  is a neighborhood node, find the node having the maximum link similarity with .If that node just represents the local community , add  into ; if not, put  into the boundary set.Through that agglomeration process, we build the current local community and construct its boundary.The example procedure is illustrated in Figure 2. In the boundary identification phase, we confirm the boundary nodes of the local community to control its exact size and scale.For a node  ∈ () that has the maximum link similarity with a node  outside , we merge the two nodes as a module and calculate the modularity gain Δ  (), which results from adding the module to the local community, to judge whether to add that module into .After all the border nodes are fixed and no neighborhood nodes can be agglomerated, we get the final resulting local community.The algorithm is as shown in Algorithm 1.
It is important to note that, in the boundary identification phase, a new metric is necessary for our algorithm to measure the goodness of the discovered local community.Many quality functions have been proposed, such as fitness [11] and modularity [21].Here we use the link-similarity-based modularity function   proposed in [22].It is extended from the normal modularity  [21] and has a better ability to deal with hubs and outliers.Given two adjacent modules  and , the modularity gain Δ  () can be defined as follows: where IS(, ) = ∑ ∈,V∈ (, V) is the total similarity of the links between two modules  and , DS() = ∑ ,V∈ (, V) is the total similarity of nodes within community , and TS represents the total similarity between any two nodes in the network .
The link-similarity-based modularity described above is an effective metric to evaluate the quality of networks partition, and we use the modularity gain Δ  () to determine whether the boundary node belongs to the local community.If Δ  () ≤ 0, that means the node  is the border node of ; otherwise, the module  should be merged into the local community and update () and () for the next new round of agglomeration iteration, until there is no node left that satisfy the above conditions.It is worth noticing that the main task of our method is finding a higher modularity solution under the principle of the boundary identification and optimal similarity-modularity, rather than to search a partition greedily maximizing the modularity   .
These boundary nodes around  are at the junction of several adjacent communities, and their special locations make it more difficult to judge whether it is contained in the local community.Therefore, during the expansion of the local community, once the link conditions of  are changed, we should update its neighborhood and boundary and reevaluate the link similarity between them for the further agglomeration and boundary identification.That heuristic mechanism can ensure that our result is a strong and stable community by performing that two-phase community detection.
The running time of our algorithm is mainly consumed in calculating the link similarity and finding the neighborhood node with the largest similarity in the process of forming the local community.The calculation of Δ  () is a little time consuming and the complexity is (||).So the computational complexity for the local community detection method is ( log ||), where  is the average degree of nodes inferred and || is the number of nodes in the local community.

Experiments and Results
To evaluate the performances of the local community detection methods, we evaluate our results by precision, recall, and -measure, which are widely adopted by other community Local Community Detection Algorithm based on Boundary Identification Input: Network (, ), and a given node  Output: The local community  that  belongs to.
(1) Synthetic benchmark networks: by using the Lancichinetti-Fortunato-Radicchi (LFR) benchmark networks [23], we can generate 9 networks with different mixing parameter  ranging from 0.1 to 0.9 with a span of 0.1 and randomly choose 100 nodes from each network to evaluate the performance of our algorithm.These networks' important properties are presented as follows: the node number is  = 10000, the average edge number is  = 151734, and the average node degree is  = 15.The community structures of these LFR benchmark networks are all known.
The mixture parameter  denotes the average ratio between external and total degree of each node in the network.Generally, the lower the mixture parameter of a network is, the stronger community structure it has.The LFR networks with various mixture parameters are necessary and valuable to evaluate those methods.Figure 3, respectively, shows the comparison results of three metrics for different algorithms, because a well-performed algorithm should achieve high precision, recall, and -measure at the same time.From the comprehensive comparison of all the metrics, it can be concluded that our algorithm achieves the best performance.
Particularly, when the mixture parameter  is less than 0.5, the average -measure of our algorithm is almost 0.90.This means our approach can discover exactly the full local community structure from the given node.But when the community structure of the LFR networks becomes weaker ( ≥ 0.5), all the algorithms suffer varying degrees of performance degradation and become ineffective to detect community structure.In general, as the community structure of LFR networks tends to be weaker and weaker, our algorithm shows much better robustness than the other three algorithms.
(2) Real-world networks: real networks are always more irregular and various than synthetic networks and have more complex community structures.Here we choose five typical real-world networks: Football [24], Polblogs [25], US airport [26], US power grid [27], and Internet [28].These networks' basic properties (about nodes, edges, etc.) are presented in Table 1.
The Louvain method (LM) [8] is one of the most popular algorithms of global community detection in terms of accuracy and computational costs and provides excellent performance even when the networks to process are very large.For the real networks with unknown community structure, here we use LM to detect the global community structure of real-world networks and find the local community that the starting node belongs to as the real local community structures.For each network, we randomly select 100 nodes to find the local community with four algorithms, respectively, and compare the results with the real community to test the stability and accuracy of our algorithm.
We next focus on results obtained on real-world network datasets.Here take Football and Polblogs networks as examples; we compare the results of different algorithms with three evaluation metrics, respectively.In Figure 4, it shows that the performance of our algorithm is quite well and balanced in precision, recall, and -measure.Meanwhile,   we can summarize that the existing methods are hardly available for optimal performance in three evaluation metrics together.As shown in Figure 4, LWP and Clauset get a higher recall but a much lower precision, which eventually leads to an unsatisfactory -measure value.For the Chen method, the three metrics are also less than satisfactory, since these methods are sensitive to the source node's position.When starting from a boundary node  of the local community, the expansion of community  might fall into a different neighbor community, which has some nodes connecting to .
As the algorithm proceeds, the discovered local community would be far from the real local community structure and results in the decrease of the recall and precision.When detecting the local community structure in the 5 real-world networks, the boundary identification phase in our algorithm can avoid introducing outliers into the local community.The two-phase process can help to discover a more stable and complete community structure as soon as possible.Since -measure combines the advantages of precision and recall, Table 2 shows the -measure of all comparison methods in those networks.Compared to the other three algorithms, the average values of -measure on our method have been improved in different degrees, respectively.For Football and US power grid networks, the improvement of -measure is relatively slighter.For Polblogs, US Airport, and Internet networks, the average values of -measure on our algorithm demonstrated a much higher increase.The local community detection method based on boundary identification demonstrated much better performance.Even in the complex large networks, it also keeps excellent accuracy and stability.Therefore, our method is more suitable for mining the meaningful local community structure in the networks.

Conclusions
In this paper, we proposed a novel local community detection method based on boundary identification.Compared with other existing algorithms, our algorithm is comparatively simpler and more effective to discover the full structure of the local community, because our approach is capable of locating the boundary of the local community by the two-phase iterations, it precisely controls the expansion of the local community, and needs not to require some extra parameters.Experimental results on the synthetic and real-world datasets all show that our algorithm is an effective method to explore local community structures.Moreover, the method works stably and effectively in most networks and can be used freely to acquire communities from the random given node.

Figure 1 :
Figure 1: An illustration of an abstract network containing the local community , its neighborhood (), and the outside node set .

Figure 3 :
Figure 3: Comparison of (a) precision, (b) recall, and (c) -measure for different methods on the LFR networks.

Figure 4 :
Figure 4: Comparison of precision, recall, and -measure for different methods on Football network.

Table 1 :
The features of the real-world networks for performance evaluation.

Table 2 :
The average evaluation values of -measure in real-world networks.