Local Community Detection Algorithm Based on Minimal Cluster

In order to discover the structure of local community more effectively, this paper puts forward a new local community detection algorithm based on minimal cluster. Most of the local community detection algorithms begin from one node. The agglomeration ability of a single nodemust be less thanmultiple nodes, so the beginning of the community extension of the algorithm in this paper is no longer from the initial node only but from a node cluster containing this initial node and nodes in the cluster are relatively densely connected with each other.The algorithmmainly includes two phases. First it detects theminimal cluster and then finds the local community extended from the minimal cluster. Experimental results show that the quality of the local community detected by our algorithm is much better than other algorithms no matter in real networks or in simulated networks.


Introduction
Community detection on complex networks has been a hot research field.Recently, a large number of algorithms for studying the global structure of the network are proposed, such as the modularity optimization algorithms [1,2], the spectral clustering algorithms [3][4][5][6], the hierarchical clustering algorithms [7][8][9][10], and the label propagation algorithms [11][12][13][14].However, with the continuous expansion of complex networks, it is easy to collect large network dataset with millions of nodes.How to store such a large-scale dataset in computer memory to analyze is a huge challenge for scholars.The calculation for studying the overall structure of this kind of large-scale networks is unimaginable.So local community detection becomes an appealing problem and has drawn more and more attention [15][16][17][18].The main task of local community detection is to find a community using the local information of the network.Local community detection has good extensibility.If the local community detection algorithm is iteratively executed, more local communities can be found and the whole community structure of the network can be obtained.The time complexity of this kind of global community detection algorithm is dependent on the efficiency and accuracy of local community detection algorithms, so the research of local community detection algorithm still has a long way to go.There are several problems that need to be solved in the research of local community detection.First, we should determine the initial state and find the initial node for local community detection, so as to determine the needed local information; then, we need to select an objective function, and through continuous iterative optimization of the objective function we find the community structure with high quality; after that we need to find a suitable node expansion method, so that the algorithm can extract the local community from the initial state step by step; finally, in order to terminate the algorithm, a suitable termination condition is needed to determine the boundary of the community.
Most of local community detection algorithms are based on the above-mentioned process.The definition of local community detection is to find the local community structure from one or more nodes, but most of the existing local community detection algorithms, including Clauset [15], LWP [16], and LS [17], are starting from only one initial node.They greedily select the optimal nodes from the candidate nodes and add them into the local community.LMD [18] algorithm extends not from the initial node but from its closest and next closest local degree central nodes.It discovers a local community from each of these nodes, respectively.It still starts from single node and discovers many local communities for the initial node.In general, the aggregation ability of a single node is lower than that of multiple nodes.So we do not just rely on the initial node as the beginning of local community expansion.Our primary goal is to find a minimal cluster closely connected to the initial node and then detect local community based on the minimal cluster.This can avoid instability because of the excessive dependence on the initial node.In this paper, we introduce a local community detection algorithm based on the minimal cluster-NewLCD.In this new algorithm, the beginning of community expansion is no longer from the initial node only, but a cluster of nodes relatively closely connected to the initial node.The algorithm mainly consists of two parts: one is the detection of the minimal cluster, and the other is the detection of the local community based on the minimal cluster.At the same time, the algorithm can be applied to the global community detection.After finding one local community using this algorithm, we can repeat the process to obtain the global community structure of the whole network.(1) Clauset Algorithm.In order to solve the problem of local community detection, Clauset [15] put forward the local community modularity R and gave a fast convergence greedy algorithm to find the local community with the greatest modularity.

Related Works of Local Community Detection
The definition of local community modularity is as follows: where  and  represent two nodes in the graph.If nodes  and  are connected, the value of   is 1; otherwise, it is 0; if nodes  and  are both in , the value of (, ) is 1; otherwise, it is 0. The local community detection process of Clauset algorithm is similar to that of web crawler algorithm.First, Clauset algorithm starts from an initial node V. Node V is added to the subgraph , and all its neighbor nodes are added to .Then the algorithm adds the node in  which can bring the maximum increment of  into the local community iteratively, until the scale of the local community reaches the preset size.That is to say, the algorithm needs to set up a parameter to decide the size of the community, and the result is greatly influenced by the initial node.
(2) LWP Algorithm.LWP [16] algorithm is an improved algorithm and it has a clear end condition compared with Clauset algorithm.The algorithm defines another local community modularity , which is expressed as where  and  represent two nodes in the graph.If nodes  and  are connected to each other, the value of   is 1; otherwise, it is 0; if nodes  and  are both in , the value of (, ) is 1; otherwise, it is 0; if only one of the nodes  and  is in , the value of (, ) is 1; otherwise, it is 0. Given an undirected and unweighted graph (, ), LWP algorithm starts from an initial node to find a subgraph with maximum value of .If the subgraph is a community (i.e.,  > 1), then it returns the subgraph as a community.Otherwise, it is considered that there is no community that can be found starting from this initial node.For an initial node, LWP algorithm finds a subgraph with the maximum value of local modularity  by two steps.First, the algorithm is initialized by constructing a subgraph with only an initial node V and all the neighbor nodes of node V are added to the set . Then the algorithm performs incremental step and pruning step.
In the incremental step, the node selected from  which can make the local modularity of  increase with the highest value is added to  iteratively.The greedy algorithm will iteratively add nodes in  to , until no node in  can be added.In the pruning step, if the local modularity of  becomes larger when removing a node from , then really remove it from .In the process of pruning, the algorithm must ensure that the connectivity of  is not destroyed until no node can be removed.Then update the set  and repeat the two steps until there is no change in the process.The algorithm has a high Recall, but its accuracy is low.
The complexity of these two algorithms is ( 2 ), where  is the number of nodes to be explored in the local community and  is the average degree of the nodes to be explored in the local community.

Description of the Proposed Algorithm
3.1.Discovery of Minimal Cluster.Generally, a network can be described by a graph  = (, ), where  is the set of nodes and  is the set of edges.It contains  nodes and  edges. represents a node set of a local community in the network and || is the number of nodes in .We introduce two definitions related to the algorithm proposed in this paper.
Definition 1 (neighbor node set).It is a set of nodes connected directly to a single node or a community.
For node , its neighbor node set can be expressed as For community  containing  nodes, its neighbor node set can be expressed as follows: ( Definition 2 (number of shared neighbors).The number of shared neighbors for nodes  and  can be calculated as The minimal cluster detection is the key of the algorithm.The minimal cluster is the set of nodes that connect to the initial node most closely.We introduce a method proposed in [22] to find the nodes that are closely connected with the initial nodes.It uses the density function Ψ [23] which is widely used and can be calculated as where | in | represents the number of edges in community  and || represents the number of nodes in community .The larger Ψ() is, the more densely the nodes in  are connected.It is necessary to set a threshold () for Ψ() to decide which nodes are selected to form the initial minimal cluster.
() and () are the thresholds to select the nodes that constitute the minimal cluster .If Ψ() ≥ () or | in | ≥ (), these nodes are considered to form a minimal cluster.Compared with other methods, the threshold value does not depend on the artificial setting, but it is totally dependent Let  = () ∩ (V) ∪ {, V}; (5) end if (6) end for (7) return  Algorithm 1: Locating minimal cluster.
on the nodes in , so the uncertainty of the algorithm is reduced.Through this process, all nodes in the network can be assigned to several densely connected clusters.In the process, the constraint conditions of the minimal clusters are relatively strict.Then the global community structure of the network is found by combining these minimal clusters.This is a process from local to global by finding all minimal clusters to obtain the global structure of the network.Our local community detection algorithm only needs to find one community in the global network.Inspired by this idea, we improve this algorithm as shown in Algorithm 1.
In the network , we want to find the minimal cluster containing node V. First we need to traverse all the neighbors of node V and to find the node  which shares the most neighbors with node V (step 3).Then take nodes , V and their shared neighbor nodes as the initial minimal cluster (step 4).Generally speaking, node V and its neighbor nodes are most likely to belong to the same community.We find the node  most closely connected with v according to the number of their shared neighbors.The more the number of their shared neighbors is, the more closely the two nodes are connected.That is to say, the nodes connected with both nodes  and V are more likely to belong to the same community.We put them together as the initial minimal cluster of local community expansion, which is effective and reliable verified by experiences.

Detection of Local Community.
First of all, we use Algorithm 1 to find the node which is most closely connected to the initial node.We take node  and node V as well as their shared neighbor nodes as the initial minimal cluster.The second part of the algorithm is based on the minimal cluster to carry out the expansion of nodes and finally find the local community.The specific process is shown in Algorithm 2.
In the algorithm, we still use  function used in the LWP algorithm as the criteria of local community expansion.Algorithm 1 can find the initial minimal cluster .After that,  Algorithm 2 finds the neighbor node set N(LC) of LC and calculates the initial value of  (step 02).Then it traverses all the nodes in N(LC) (steps 03-04) to find a node which can make Δ maximum and add it into the local community LC (steps 05-08); update N(LC) and  (step 09) until no new node is added to LC (step 10).
The complexity of the NewLCD algorithm is almost the same as the Clauset algorithm.The NewLCD algorithm uses extra time of finding minimal cluster which is linear to the degree of the initial node V.

Experimental Results and Analysis
In this section, the NewLCD algorithm is compared with several representative local community detection algorithms, namely, LWP, LS, and Clauset, to verify its performance.The experimental environment is the following: Intel (R) Core (TM) i5-2400 CPU @ 3.10 GHz; memory 2 G; operating system: Windows 7; programming language: C#.Net.

Experimental Data.
The dataset of LFR benchmark networks and three real network datasets are used in the experiments.
(1) LFR benchmark networks [24] are currently the most commonly used synthetic networks in community detection.It includes the following parameters: N is the number of nodes; min  is the number of nodes that the minimum community contains; max  is the number of nodes that the biggest community contains;  is the average degree of nodes in the network; max  is the maximum degree of node; mu is a mixed parameter, which is the probability of nodes connected with nodes of external community.The greater mu is, the more difficult it is to detect the community structure.We generate four groups of LFR benchmark networks.Two groups of networks, B1 and B2, share the common parameters of  = 1000,  = 20, and max  = 50.The other two groups of networks, B3 and B4, share the common parameters of  = 5000,  = 20, and max  = 50.The community size {min , max } of B1 and B3 is {10, 50} and the community size {min , max } of B2 and B4 is {20, 100}, implying small community networks and large community networks, respectively; each group contains nine networks with mu ranging from 0.1 to 0.9 representing from low to high hybrid network.The details are shown in Table 1.
(2) We choose three real networks including Zachary's Karate club network (Karate), American college Football network (Football), and American political books network (Polbooks).The detailed information is shown in Table 2.

Experiments on Artificial Networks.
Because of the large size of the synthetic networks, 50 representative nodes are randomly selected from each group as the initial node and all the experimental results are averaged as the final result.Figures 3-6 are the comparison chart of the experimental results of each algorithm on the four groups of LFR benchmark networks (B1-B4).The ordinate represents the three evaluation criteria for local community detection, respectively, and the abscissa is the value of mu (0.1-0.9).The following conclusions can be obtained by observation.
(1) LS and LWP algorithms have higher Precision compared with Clauset algorithm.But their Recall value is lower than Clauset algorithm.LS and LWP algorithms cannot have both high accuracy and Recall.Their comprehensive effect may be not higher than the benchmark algorithm Clauset.
(2) All these three indicators of NewLCD algorithm are significantly higher than Clauset algorithm, which shows that the initial state indeed affects the results of local community detection algorithm, and starting from the minimal cluster is better than a single node.
(3) Overall, NewLCD algorithm is the best.On the four groups of networks, when the parameter mu is less than 0.5, NewLCD algorithm can find almost all the local communities where each node is located.In high hybrid networks, when the value of mu is greater than 0.8, the local community detection effect of NewLCD algorithm is not good, just like other algorithms.The main reason is that the community structure of the network is not obvious.
In summary, NewLCD algorithm can detect better local communities on the artificial networks than the other three local community detection algorithms.

Experiments on Real Networks.
In order to further verify the effectiveness of NewLCD algorithm, we compare it with three other algorithms on three real networks (Karate, Football, and Polbooks).These three networks are often used to verify the effectiveness of algorithms on complex networks.The experimental results are shown in Table 3 and the maximum values of each indicator are presented in boldface.The maximum value of Precision on Karate is 0.989 obtained by LS algorithm.But its Recall value is just 0.329 which is the minimum value among these four algorithms.So the result of LS algorithm is the worst.On Karate networks, Clauset algorithm and LWP algorithm have the same problem as LS, which means that their Recall value is low.While the Recall and F-score values of NewLCD algorithm are the largest, NewLCD algorithm is optimal.On the Football network, the comprehensive effect of NewLCD algorithm is also the best.On the Polbooks network, the advantages of NewLCD algorithm are more obvious, and the three indicators of its  results are all the best.In summary, not only can NewLCD algorithm be effectively applied on the artificial network, but it can also be very effective on the real networks.Karate network is a classic interpersonal relationship network of sociology.It reflects the relationship between managers and trainees in the club.The network is from a Karate club in an American university.The club's administrator and instructor have different opinions on whether to raise the club fee.As a result, the club splits into two independent small clubs.Since the structure of Karate network is simple and it reflects the real world, many community detection algorithms use it as the standard experimental dataset to verify the quality of the community.In order to further verify the effectiveness of the algorithm, we do a further experiment on Karate.Figure 7

Conclusion
This paper proposes a new local community detection algorithm based on minimal cluster-NewLCD.This algorithm mainly consists of two parts.The first part is to find the initial minimal cluster for local community expansion.The second part is to add nodes from the neighbor node set which meet the local community condition into the local community.We compare the improved algorithm with other three local community detection algorithms on the real and artificial networks.The experimental results show that the proposed algorithm can find the local community structure more effectively than other algorithms.represents a conflict of interests in connection with the work submitted.

Figure 1 :
Figure 1: Definition of local community.

Figure 2 :
Figure 2: The discovery of minimal cluster.

Table 2 :
Real network information.

Table 3 :
The comparison of algorithms on the real networks.