An Improved Local Community Detection Algorithm Using Selection Probability

In order to find the structure of local community more effectively, we propose an improved local community detection algorithm ILCDSP, which improves the node selection strategy, and sets selection probability value for every candidate node. ILCDSP assigns nodes with different selection probability values, which are equal to the degree of the nodes to be chosen. By this kind of strategy, the proposed algorithm can detect the local communities effectively, since it can ensure the best search direction and avoid the local optimal solution. Various experimental results on both synthetic and real networks demonstrate that the quality of the local communities detected by our algorithm is significantly superior to the state-of-the-art methods.


Introduction
In the real world, many complex systems can be described through various kinds of networks, such as interpersonal networks, biological networks, neural networks, social networks, and WWW.Commonly, in these networks, individuals represented by nodes are linked with some special relationships.A large amount of studies reveal that there exist underlying communities in most complex networks.Community detection, as a key technology for network analysis, can discover the hidden structures and functions in complex networks, which is attracting a considerable amount of attention from researchers in various domains.
In recent years, a large number of community detection algorithms have been proposed, such as module optimization algorithm [1][2][3], spectral clustering algorithm [4,5], hierarchical clustering algorithm [6,7], and label propagation algorithm [8,9].Most of them focus on finding the global community structure which contains all the possible communities.However, in real networks, such as WWW, they are large-scale and even dynamic.The information of the whole network is very difficult or even impossible to be obtained, so none of the algorithms is suitable to detect the community structures of large-scale networks with incomplete information.Because of these restrictions, the problem of local community detection has drawn the scholars' attention increasingly.Some local community detection algorithms have been proposed based on the limited information, such as Clauset [10], LWP [11], and LS [12].These algorithms start from a source node and then gradually join nodes to the local community.Eventually they will find the local community with the source node.Usually the nodes in a community can be divided into core nodes and edge nodes.Core node is the node which connects closely with the interior nodes of the community it is located in.Edge node is the node which connects sparsely with the interior nodes of the community it is located in.The effect of an algorithm depends on the choice of the source node.If the core node is the source node, the algorithm usually can get better structure of the local community.Most of the local community detection algorithms adopt strategy of climbing optimization, which will make the algorithm fall into local optimum, since they join the node with the optimal value of the objective function at each step.As shown in Figure 1, the source node is the edge node  of community .Because the node  has closer connection with community , the next step may join node  of community  to the local community.Then the following steps may continue to gather nodes of community .Finally a local community composed of community  and node  of community  could be obtained.So the community  where the node  really locates could not be obtained.
To solve the above problem, we propose an improved local community detection algorithm using selection probability (ILCDSP).The main idea of the algorithm is to set selection probability for the candidate nodes at each step, making the nodes with high selection probability more probably be chosen.ILCDSP screens nodes randomly; it does not just only select the node with high modularity, and thus it leads the algorithm process to the best search direction.It can avoid such a vicious circle that one misstep will undo all our work.The performance of the algorithm is verified, respectively, on the real and simulated data sets.The evaluation method used is Precision, Recall, and -Score.The experimental results show that this algorithm, compared with Clauset and LWP, could find a local community with higher quality.
The paper is organized as follows.Section 2 introduces concepts and algorithms related to local community detection; Section 3 introduces the algorithm ILCDSP proposed in this paper; Section 4 verifies the performance of the algorithm ILCDSP through the experiment on the real and simulated networks; and Section 5 gives the conclusion and discusses the future work which will be done.[10].Usually the following method is used to define the problem: there is an undirected graph  = (, );  represents the node set of the graph and  represents the edge set of the graph.The connectivity information of part of the nodes in the graph is known or available.As shown in Figure 2, the known local community is defined as area , and the set of nodes connected with nodes in  is defined as area ; the set of nodes in  which is connected with the nodes in  is defined as edge nodes area , where any node in  has at least one neighbor node in .The rest of  is the core nodes area .

Problem Definition. The definition of local community detection was firstly proposed by Clauset
The task of local community detection is to constitute a local community  from a source node.During the process, the algorithms should merge the node which meets the conditions into  and remove the node in  which does not meet the conditions.Different algorithms have different conditions to select node, and these will be shown in the next section.

Related Algorithms.
We have reviewed several effective approaches to explore local community structure.These algorithms are presented below in the sequence of publication date.

Clauset Algorithm.
In order to solve the problem of local community detection, Clauset [10] defined local community modularity  and proposed a greedy algorithm of fast convergence to find the local community with maximum modularity.
Local community modularity  is defined as follows: where  and  represent node  and node , and The process of Clauset algorithm is similar to web crawler.First, it starts from a source node V, merges node V into the subgraph , merges all neighbor nodes of node V into , and then iterates the following steps.
Step 1. Calculate Δ for all the nodes in .Δ is the increment of modularity .
Step 2. Merge the node  with the largest modularity incremental Δ into .
Step 3. Merge all the neighbor nodes of node  into  and then update  and .
The algorithm merges the neighbor nodes, which will be able to bring the biggest increment of  into the local community one by one until the size of the local community reaches the predetermined value.That is to say, this algorithm needs to set a parameter to decide the size of the local community in advance.The clustering result is highly affected by the source node.[11] algorithm is an improved algorithm.Compared with Clauset, it has definite termination conditions.The algorithm defines the community modularity  as follows:

LWP Algorithm. LWP
where Given an undirected graph (, ), LWP algorithm will start from a source node and then find a subgraph with maximum .If the subgraph is a community ( > 1), the algorithm returns the community consisting of the subgraph.Otherwise, we hold the opinion that it cannot find any community starting from the source node.For one source node V, LWP algorithm finds the subgraph with maximum  through two main steps.Firstly, a subgraph which has only one source node V is constructed, and then the neighbor nodes of V are merged into .The following is the incremental step and pruning step of the algorithm.
In the incremental step, the algorithm merges the nodes of  into  iteratively until  is extended to a certain degree.Each time the algorithm will select the node which can bring the biggest increment of ; in the pruning step, the iteration algorithm will remove the nodes of  which will make the local community modularity of  increase until no nodes can be removed.This algorithm turns out to result in high recall but low accuracy.[13].The algorithm from a source node, according to the boundary, gradually merges node into the community; here the boundary is the set of nodes whose distance to the source node is a fixed value.The effect of the algorithm depends on the choice of the source node.If the source node is a boundary node rather than a core node, the final clustering results will be very different.In order to overcome this problem, the authors suggest making each node as a source node at a time, repeating the calculation to find the optimal result; however, the speed of the algorithm is very slow.

Improved Local Community Detection Algorithm Using Selection Probability
Most of the existing algorithms utilize the greedy strategy to select the present optimal node to join the local community.It can easily make the algorithm traps in local optimal solution.In order to avoid the occurrence of this phenomenon, this paper on the basis of the local community modularity, will give another selection criteria of nodes-selection probability.

Selection Probability.
LWP algorithm needs to select the subsequent node according to the largest increment of .
The nodes which make the increment of  greater than zero are regarded as candidate nodes.Here, we set the selection probability  for the candidate nodes according to Δ which is the increment of .Consider where   is the selection probability of candidate node , and  is the number of candidate nodes.
The selection probability is determined by Δ.Usually the lager Δ is, the lager the selection probability is; that is to say, the greater the increment of  is, the greater likelihood the node will be added to the local community.

3.2.
Steps of the Algorithm.ILCDSP algorithm generates random number  between 0 and 1; if node  will be merged into community .It is clear that the node with a greater selection probability will have more chance to be selected.
Steps of the algorithm are shown in Algorithm 1. ILCDSP algorithm firstly merges the source node V 0 into the local community  and then merges all the neighbor nodes into  (Step 1).It initializes the community modularity as zero (Step 2), goes through each node in , then chooses the nodes which make Δ lager than zero as candidate nodes, and, at the same time, calculates their selection probability (Step 3).ILCDSP algorithm generates random number  between 0 and 1, then selects node V according to the node selection probability, and merges it into .It updates , , and  and repeats the above steps until no node can be merged into  (Steps 4-9).Finally the algorithm will return the local community  containing the source node (Step 10).
Here, we give an example to illustrate the application of the algorithm, as show in Figure 3.The source node is the edge node  of community .In order to find the local community  that node  locates, the next step should select one node from the candidate nodes 1, 2, 3, and 4 to be joined into the local community.Their increment of  is Δ 1 = Δ 2 = Δ 3 = 1/6, Δ 4 = 1/5.According to LWP, node 4 with the largest increment of  should be joined into the local community.It is clear that node 4 does not belong to community .If node 4 is joined into the local community, the following steps may continue to gather nodes of community .Finally, we will get a local community composed of community  and node  of community .So we cannot get the community  where the node  locates.But in our algorithm, we calculate the selection probability for nodes 1, 2, 3, and 4. Considering  1 =  2 =  3 = 5/21,  4 = 2/7, one node will be randomly selected from them according to our algorithm.Although node 4 cannot avoid being selected completely, the probability of each candidate nodes has to be considered; our algorithm reduces the probability of the situation like node 4. It can get a better result.
In order to improve the efficiency of the algorithm, for any node V  ∈ , the formula (6) [12] can be used to speed up the calculation of Δ  , where  in denotes the number of edges inside  and  out denotes the number of edges between  and . represents the number of edges that will be added into  because of the agglomeration of node V  , and  is the number of edges which connect V  and nodes outside .It is easy to find out that the degree of V  equals to  + .
The original calculation of Δ is to use the  after joining node V  minus the  before joining V  .It equals to formula (6), which can be proved as follows.
Firstly,  And then Compared to the original calculation of Δ which needs to traverse all the nodes each time, the formula (6) only needs to know  and  of the node which will be joined into the local community.

Experiments
This section verifies the performance of the improved algorithm ILCDSP.We will compare ILCDSP algorithm with several typical algorithms of local community detection; respectively, they are Clauset , Clauset , and LWP algorithm.Clauset  and Clauset  are the algorithms taking  and , respectively, as the objective functions.Experimental environment is the processor, Intel (R) Core (TM) i5-2400 @ 3.10 GHz CPU 3.10 GHz; the memory, 2 G; the operating system, Windows 7; and the programming language, C#.Net.4.1.Datasets.We select three real networks and LFR benchmark network as experimental data.
(1) LFR benchmark network [14] is the most commonly used data sets in the current study of community detection.LFR benchmark networks mainly include the following parameters:  is the parameters of node degree distribution;  is the average degree of nodes in network; max  is the biggest degree of node; Min  is the number of nodes that the smallest community contains; max  is the number of nodes that the biggest community contains; and  is a mixed parameter, which is the probability of nodes connected with nodes of external community.The bigger  is, the more difficult the community detection is.We produce four groups of LFR benchmark network; respectively, two of the groups share these parameters  = 1000,  = 20, and max  = 50, and the other two groups share parameters,  = 5000,  = 20, and max  = 50; each group contains nine simulation networks, and the detailed parameters setting is shown in Table 1.The min  and max  of B1, B2, B3 and B4 were respectively set to {10, 50} and {20, 100}, that represents  small community networks and big community networks.The value of  in each group is set from 0.1 to 0.9, which generates nine simulation networks; these values represent that the networks change from low hybrid network to high hybrid network.
(2) The detailed information of real network data is shown in Table 2.

Evaluation.
We utilize Precision, Recall, and F-Score as the evaluation criteria.They are used frequently in many areas such as statistics, information retrieval, and machine learning [16][17][18].Some other articles about community detection also used these metrics criterions [19][20][21][22][23]. Precision is the fraction of the correct classification of nodes in the community; Recall is the ratio of the number of correctly classified nodes to the total number of nodes that should be agglomerated into the community; and -Score is the harmonic value of Precision and Recall.The specific formulas are as follows: where   is the local community where the source node belongs in the true partition, and   is the local community of the source node identified by the algorithm of local community detection.A well-performed algorithm should get high Precision, Recall, and -Score at the same time.In the experiments, every node in the network is taken as a source node to discover its local community.We average the Precision, Recall, and F-Score for all nodes to evaluate and compare the accuracy of our algorithm against the others.Figures 4,5,6,and 7 show the results of algorithms detecting local community in four groups of LFR benchmark networks (B1-B4), respectively.The horizontal is the value of mu which is between 0.1 and 0.9; the vertical is the three evaluations of local community detection.From these figures, it can be seen that the results of Clauset  and Clauset  are completely overlapped.Although the form of the community modularity  and  is different, the effect of the algorithm used them is the same.At the same time, it is obvious that the improved algorithm is better than the original algorithm LWP.Although ILCDSP does not improve the value of Precision, it greatly improves the value of Recall, thus making the value of -Score greater than the LWP significantly.A well-performed algorithm should have a high Precision, Recall, and -Score at the same time; thus the proposed algorithm in this paper is superior to LWP.That is to say, our algorithm can better detect the structure of local community compared with the original algorithm.

Analysis of Real Network.
In our experiments, we apply each algorithm, respectively, in three real networks.Table 3 shows the results.The bold figures are the maximum values of each evaluation on each network.The results of Clauset  and Clauset  are the same.Most of the maximum values appear in ILCDSP.In spite of the fact that the Precision of Clauset  and Clauset  in karate and polbooks is greater than ILCDSP, their Recall is lower than ILCDSP, which leads to the -Score of the former two algorithms to be lower than ILCDSP.It is said that the overall effect of ILCDSP is the best.Through Figure 8, it can be seen intuitively that the result of our algorithm is better than other algorithms.It further shows that our algorithm cannot only detect local community of high quality in the simulated network but also could be applied to detect community structure effectively in real networks.

Conclusions
This paper proposed an improved local community detection algorithm-ILCDSP.The algorithm firstly sets selection probability for each candidate node, making the nodes with high selection probability more likely to be selected, and then it randomly screens nodes.The algorithm will process to the best direction, so as to improve the accuracy of local community detection.Experimental results show that ILCDSP could detect the structure of the local community more effectively than other algorithms both in the real and simulate networks.Although this proposed algorithm improves the accuracy of community detection, it is not stable enough; at the same time, it remains to be further researched and improved in time.

Figure 2 :
Figure 2: Definition of local community.

Figure 3 :
Figure 3: An example of application of ILCDSP.

Figure 8 :
Figure 8: Comparison of different real networks.
Input: A source node V 0 and a network Output: A local community containing source node V 0 (1) Merge the source node V 0 into  and merge the neighbor nodes of V 0 into  (2) Set local modularity  = 0 (3) Calculate   for each node in  which makes Δ > 0 )

Table 1 :
Information of LFR benchmark network.

Table 2 :
Information of real networks.

Table 3 :
Comparison of different real networks.