A Modularity Degree Based Heuristic Community Detection Algorithm

. A community in a complex network can be seen as a subgroup of nodes that are densely connected. Discovery of community structures is a basic problem of research and can be used in various areas, such as biology, computer science, and sociology. Existing community detection methods usually try to expand or collapse the nodes partitions in order to optimize a given quality function. Theseoptimizationfunctionbasedmethodssharethesamedrawbackofinefficiency.Hereweproposeaheuristicalgorithm(MDBH algorithm)basedonnetworkstructurewhichemploysmodularitydegreeasameasurefunction.Experimentsonbothsynthetic benchmarksandreal-worldnetworksshowthatouralgorithmgivescompetitiveaccuracywithpreviousmodularityoptimization methods,eventhoughithaslesscomputationalcomplexity.Furthermore,duetotheuseofmodularitydegree,ouralgorithm naturallyimprovestheresolutionlimitincommunitydetection.


Introduction
Many of the complex systems in nature and society can be thought as networks composed of nodes and edges, the Internet, metabolic networks, food web, neural networks, logistics and supply chain networks, industry cluster networks, and social organization networks are some of them [1][2][3].With the in-depth study of the physical sense and mathematical properties of the network, it was discovered that many networks have a common nature, which is called the community structure.It refers that the whole network is constituted by multiple groups or clusters.Connections between nodes in the same group are relatively dense, while connections between nodes in different groups are sparse [4,5].Discovering community structure in networks is an important way to analyze the structure, features and functions of real networks, and it has a wide range of applications in the field of natural sciences, engineering technology, economy and sociological research.
Community detection methods usually share the metaprocedure of "trying to expand or collapse the node partitions of a given graph , in order to optimize a given quality function, stopping when no increment is possible" [5].There are some commonly known quality functions that are able to quantify whether a set of entities are more related than expected and thus can be considered as a community [7][8][9].Modularity is the most commonly known measure for community structure introduced by Clauset et al. [8]; the larger the value of modularity, the more accurate a partition into communities is, so it provides a way to determine if a certain description of the graph in terms of communities is more or less accurate.Due to the fact that the space of possible partitions grows faster than any power of the system size, the computational complexity of searching optimal (largest) modularity is in the nondeterministic polynomial time hard (NP-hard) class.For this reason, researchers start to use heuristic strategies to restrict the search space while finding the optimal solutions [10][11][12][13].Duch and Arenas proposed an extremal optimization (EO) strategy for modularity optimization [14], their experimental result showed that the EO algorithm is more efficient than Newman's fast algorithm [11] and outperforms "all previous algorithms that exist in the literature." Kumar and Jayaraman employed Group Search Optimizer (GSO) algorithm for community detection [7].GSO algorithm is a new algorithm in the field of evolutionary computing.GSO is a process of obtaining optimum solution in a search space, which is analogous to search for better clustering of a network to obtain the best Newman's modularity.Kumar's approach uses optimal modularity for candidate solutions' selection and follows the GSO process to detect community structure.Evaluation result showed that GSO algorithm is capable of identifying community structures in standard benchmark datasets.The community extracting method proposed by Blondel et al. [12] focuses on dealing with large networks, a mobile phone network of 2.6 million nodes, and a web graph of 118 million nodes which were used to prove the capability of the method.These modularity optimization based approaches are somehow efficient, but are all falling into the resolution limit which is decided by the modularity itself, which means that modularity optimization approach may fail to identify the modules smaller than a scale.A significant amount of work has been carried out to solve the problem of resolution limit [15][16][17].Li et al. proposed a quantitative function for community partition called modularity density [9], and it is superior to modularity.Li et al. 's work also proved modularity density is equivalent to the objective function of the kernel k-means and showed that optimizing the modularity density cannot only correctly identify the number of communities, but also can resolve detailed modules that existing approaches cannot achieve.
Here we propose a heuristic algorithm which adopts the local community structure as a basis to make it a competitively faster method and uses modularity density as measure function to ensure that it is an accurate algorithm.At the same time, our algorithm needs no prior conditions as input.

Definitions
As far, there is still no a strict definition about community structure.It is obvious that a reasonable community structure should have more connections in the same community than between different communities.This is to say, the community structure which has the most edges within communities and fewest edges between communities tends to be reasonable.Figure 1 shows an ordinary network diagram, where V is a community structure having  ( = 5) nodes,  is a node connected to V and  ∉ V, and  equals to the degree of node .Define (, ) as the number of nodes in V which is directly connected to node .It is to say, (, ) is equal to the number of neighbors of node  which belong to community V. Now we try to add node  to community V to form a new community which is labeled as V  , then community V  will have (, ) more inner edges than community V. Obviously, node  is supposed to be added to community V if (, ) is large enough.
In order to quantitatively measure whether it is reasonable for node  to join community V, we use modularity density  as the measure function.
Given a network  = (, ),  is the vertex set,  is the edge set, and  is the adjacent matrix of .If   and   are two disjoint subsets of  and   ∩   = Ø, we further define (  ,   ) as the number of edges which has one node in To facilitate memory, ( 1) is simplified as Here,  in (  ) is equal to the number of inner edges of   ,  out (  ) is equal to the number of outer edges of   , and Num(  ) is equal to the number of nodes that are in   .Then, the modularity density of community V in Figure 1 can be expressed as  V = ( in −  out )/, where  in is the number of inner edges of community V,  out is the number of outer edges of community V, and  is the number of nodes in community V.
Next we try to get a constraint function for our algorithm to decide whether it is reasonable to add this specified node  to the community V.
If node  has been added to the community V to form a new community V  , then the inner edges of According to the theory of  value, to make V  a more reasonable community than V,   V >  V should be satisfied; that is, Inequality ( 5) can be simplified as The key to our approach is to traverse the nodes in the network.For each node , we see it as an independent community at first.We first try to get the community such as V, which obtains the largest number of neighbors of node , then we attempt to add  to V and begin to judge whether it meets the constraint of inequality (4).Node  will be added to V if the constraint is satisfied.After several times of iteration, the community structure of network will remain unchanged, then we get the final network division result.

MDBH Algorithm and Analysis
In this section, we are going to discuss the details about our MDBH algorithm.Suppose the number of nodes of network  is , the degree of node  is ().We first initialize the network as follows.
( (3) The initial number of iterations of our algorithm is  = 0.
Algorithm 1 shows what MDBH does during a round of search.The algorithm will terminate until none of the nodes in network changes the community it belongs to after iteration , where  is the final value of iteration number.
The initialization takes () to deal with  nodes and  communities; operations on them are simple numeric expressions.Our algorithm needs  iterations in total.For each iteration, MDBH tries to find maximum (,   ) for current node and the quantities of (,   ) satisfying (,   ) ̸ = 0 is ∑  =1 (), so time complexity of "Step 1" takes up to ( ∑  =1 ()).Obviously, the time complexity for calculating (5) is ().The first three update operations in Step 2 are linear operations, so the time complexity is (3).The fourth updating operation is a linear operation too.The number of node's neighbors is the degree of the node, so it takes ( ∑  =1 ()).Now add these results obtained by analysis up, we get the time complexity of MDBH, which is ( +  ∑  =1 () +  + 3 +  ∑  =1 ()), which can be simplified as (+ ∑  =1 ()).If we define  as the number of edges the target network has, then ∑  =1 () = 2, and the time complexity of our algorithm will be ( + ).
The value of  is a vital factor which affects the actual speed of our algorithm.The smaller the value of , the faster the MDBH algorithm is.In each iteration we try to add the current node to community which contains the maximum number of neighbors of the node, and the community which is larger is more likely to meet this condition, so the potential main communities which include most of nodes in network will be formed quickly after a small amount of iterations especially when the network has a good community structure.And if the community structure of the network to be analyzed is so bad that it consists of a large amount of small communities, then  will be larger; however, it is not meaningful to detect community structure in such networks.Therefore, the value of  is supposed to be small enough to make our algorithm fast.

Experimental Evaluation
In order to verify the performance of MDBH algorithm, we chose two commonly known benchmark networks: "Zachary club network" and "Dolphin social network." We also built our own networks: a character relationship network of the famous Chinese classic novel book "Romance of the Three Kingdoms" and computer-generated networks with large number of nodes and clear community structures.Experiments on these networks were conducted on a typical desktop computer with a 3.0 GHz Pentium 4 processor and 3 GB of RAM.

Zachary Club Network.
Early 1970s, Zachary spent two years observing the friendships between members of a karate club in a university in America, and constructing a network of relationships between them [18].It consists of 34 members of a karate club as nodes and 78 edges representing friendship.Due to disagreement between the club's administrator and the club's instructor, the club was split into two small ones, as shown in Figure 2.
The detailed experimental process on Zachary club network is to be described below.Firstly, we gained adjacent matrix of Zachary club and initialized it.Then we began to traverse the network from the 1st node to try to find the maximum of (1,   ).There was more than one community met this condition.The community possessing the largest summation of member's degree was chosen.For the 1st node, we got  = 3.We judged the constraint condition in inequality (5) and the result was −11 > −10; this inequality was not satisfied, so we went on with the 2nd node.When  = 1, the value of (2,   ) was the largest and (5) was satisfied, so the 2nd node was removed from  2 and was added to  1 , after this, we followed the update process of Step 2 in Algorithm 1. Continuing with the 3rd node until all 34 nodes in the network were handled, then we got the result of the first iteration as shown in Figure 3.
Apart from the node of 24th, 25th, and 26th, other nodes in the network were partitioned to two major communities  02 and  34 ; the community structure of Zachary club tends to disclose after the first iteration.The reason is that during the first iteration, the larger communities are more likely to contain the maximum number of neighbors of the current node, so nodes in smaller community tend to be merged into a larger community.Therefore, the two major communities became stronger after the first iteration.In the second iteration, node 24th, 25th, 26th, and 31st were removed from prior communities and merged to  34 in turn.It is apparent that the first three nodes are supposed to be merged to  34 according to the network topology.The 31st node has two edges connected with both  01 and  34 , respectively, but  34 became stronger after the first three nodes were added to it, so the 31st node changed its community from  01 to  34 .This adjustment reflects that our algorithm has the ability to make reasonable adjustments according to the specific changes in current community structure, and our algorithm could get the more reasonable community structure though it may make wrong decisions in previous iterations.The 9th node was merged to  34 in the third iteration.The 3rd node was emerged to  34 in the fourth iteration.None of nodes changed its community in the fifth iteration, so our algorithm stopped after the fifth iteration and got the final result as Figure 4.The result is completely consistent with the well-known algorithms for community detection when the ambiguity of the 3rd node is not considered.

Dolphin Social Network.
From 1994 to 2001, Lusseau studied 62 dolphins living in Doubtful Sound, New Zealand.Through the observation of contact between them, he built a dolphin social network [19].If two dolphins are often together, then an edge will exist between them.What is interesting, this group of dolphins is automatically differentiated into two smaller ones because of the departure of a key member of them.
Again, the proposed algorithm was applied to the Dolphin social network, and it took total of 5 iterations to get the final result.During the first iteration 57 nodes changed their communities, resulting in two large communities and 6 small ones.And during the second, third, and fourth iteration, only nodes 5, 3, and 1 were separately moved to a new community.None of nodes changed their community in the fifth iteration which means that our algorithm only deals with very few nodes in the rest iterations.Network was eventually divided into two parts, with 41 nodes and 21 nodes, respectively.The final result matches the real situation very well as shown in Figure 5.By applying our algorithm to well-known Zachary club and Dolphin social network, we found that the majority of nodes could be classified to the correct communities after the first iteration, and community structure of the network could quickly get a clear manifestation.And only few nodes need to be adjusted in the rest of the iterations.The reason why our algorithm has a good convergence rate is because possible gains in modularity density are easy to compute with the above formula and that the number of communities decreases dramatically after just a few iterations so that most of the running time is concentrated on the first few iterations.

Three Kingdoms Network (TK Network).
In the first two experiments, our algorithm is proved to be capable of dividing target networks into communities as reasonable as what well-known community detection methods can do.In order to prove that our algorithm is good enough to be used to analyze more complex networks models, we built an empirical network called the "Three Kingdoms network." The "Romance of the Three Kingdoms" is a famous classic novel describing the story of the Eastern Han Dynasty of China.The characters in this novel maintain a hierarchical community structure which is consistent with historical knowledge.We got inspired by Ravasz and Barabási's work [20] and built the TK network based on the relationships between the characters of the first five chapters of the novel with 55 nodes and 77 edges.Each node in the network represents a character.And if two characters in the novel have direct dialogue, then an edge exists between them.
First, we applied three Kingdoms network to fast algorithm of Newman which is a typical method using modularity  [8]. Figure 6 shows the result when getting the maximum value of  which equals 0.35.
The network was divided into 27 small communities by fast algorithm of Newman.Communities in Figure 6 were labeled with different colors or shapes and every rectangle node labeled white representing a single community.The result shows that too many groups were found by fast algorithm and this does not meet the actual situation of that period in novel.The nodes of 3rd, 4th, and 5th are well-known brothers, but they were finally emerged into three different groups.The node of 25th ought to be a liegeman of 15th, but they were also separately divided.
According the common understanding about the history of the Three Kingdoms Period, Newman's fast algorithm did not give us a satisfactory result on target network.
Figure 7 shows the result of our algorithm.All six main groups were identified in Figure 7 by our algorithm after 4 iterations.They are cliques led by the nodes of 3rd, 11th, 20th, 18th, 15th, and 22nd, and they are all influential forces in the early era of the Three Kingdoms.Almost all nodes were divided into correct groups by our algorithm.Some small communities in Figure 6 were emerged into the six main groups by using the judgment of modularity density in (4).The division result in Figure 7 keeps well in line with the common understanding about the relationships of characters in the novel.So we should say that our algorithm performed better than the famous fast algorithm of Newman in real networks of three Kingdoms network.

Computer-Generated Network.
From the complexity analysis, we know that the iteration time of  is a decisive factor which affects the actual speed of our algorithm.The value of  varies from one network to another, and we can hardly get it directly by analysis or calculation, so we try to get the law of  through the experiments of computer-generated network.
Generated network used here consists of three subnetworks, and the total number of nodes and edges are specified.There is only one edge linking different sub-networks.Nearly one third of total edges are within each of the three subnetworks, and they are randomly connected.Here we specify the number of nodes  in generated networks as 600, 1200, 2400, 4800, 9600, 19200, 38400, 76800, 153600, 307200, and 614400 and we specify the number of edges as  =  log 2  and  = 3, respectively.
Table 1 is the summary of the numerical results of the experiments on networks with from 600 to 9600 nodes ( = log 2 ).For each Iteration/ the table displays the number of nodes who changed its former community to another during iterations.It took only 6 iterations for our algorithm to accomplish community detection.Figure 8 shows the partition result of the network with 4800 nodes and 56278 edges.It took only 15 seconds for our algorithm to partition this network.
In these experiments, the community structure of the network with 9600 nodes was supposed to be clearer than that of the network with 600 nodes, and the real results show that the former network takes less iteration than the later one, even though the former network has a larger scale.From this point of view, MDBH algorithm behaved  well in fast community discovery for network with clear community structure.
The statistics in Table 2 shows the number of nodes who were removed from original communities and were reassigned to other communities in each iteration where  = 3.With the increase of the number of nodes, the community structure of networks in this experiment became unclear.When the number of nodes exceeds 2400, the networks were divided into a large amount of or even over a thousand small communities, but it took at most 20 iterations for our algorithm to converge, even though the network may have nearly ten thousand nodes and so unclear community structures.
Both statistics in Tables 1 and 2 show the fact that the iteration time of  in our algorithm depends more on the community structure rather than the network size .A network with clear community structure takes only a small number of iterations and performs well in fast community detection.And even a network with unclear structure requires iterations of  which is far less than .
and the other node in   .Then (  ,   ) is the number of inner edges of   and (  ,   ) is the number of outer edges of   (  =  −   ) the modularity density of community   is defined as 1) Each node  in the network independently forms a community   , the inner edges of   are  in (  ) = 0, the outer edges of   are  out ( ) = (), and the number of nodes in   are Num(  ) = 1.(2) For each node , we maintain a collection {(,   ) | ,  = 1, 2, . . ., }, which means the number of neighbors of  in community   .If node  and node  are neighbors, initialize (,   ) = 1, otherwise, initialize (,   ) = 0.