An Autonomous Divisive Algorithm for Community Detection Based on Weak Link and Link-Break Strategy

Divisive algorithms are widely used for community detection. A common strategy of divisive algorithms is to remove the external links which connect different communities so that communities get disconnected from each other. Divisive algorithms have been investigated for several decades but some challenges remain unsolved: (1) how to efficiently identify external links, (2) how to efficiently remove external links, and (3) how to end a divisive algorithm with no help of predefined parameters or community definitions. To overcome these challenges, we introduced a concept of the weak link and autonomous division.The implementation of the proposed divisive algorithm adopts a new link-break strategy similar to a tug-of-war contest, where communities act as contestants and weak links act as breakable ropes. Empirical evaluations on artificial and real-world networks show that the proposed algorithm achieves a better accuracy-efficiency trade-off than some of the latest divisive algorithms.


Introduction
The study of networks is now one of the most active interdisciplinary research fields [1,2].In the research of computer science and sociology, complex systems are abstracted as networks or graphs.The basic components of the network are nodes and links.Nodes represent entities of interest.Links represent associations among entities.Community structure is one of the most important properties of complex systems, and community detection is an effective approach to study this property.The goal of detecting community structure is to get an appropriate classification where the links to the nodes with the community are dense, while the links to the nodes out of the community are sparse [3][4][5][6][7].
Nowadays, different community detection algorithms have been proposed [1,2], such as divisive algorithms [8][9][10][11], clustering algorithms [5,[12][13][14][15], modularity optimization algorithms [16][17][18][19][20], and label propagation algorithms [21][22][23][24].This paper focuses on the study of divisive algorithms which separate communities by detecting and removing links.Girvan and Newman [25] proposed a significant algorithm based on the betweenness which can identify external links [10].However, as a global centrality index, the calculation of betweenness is time-consuming and each iteration of the algorithm removes only one link from the network.To improve the efficiency of divisive algorithms, Radicchi et al. [9] proposed the edge-clustering coefficient which is a local centrality index.Based on the edge-clustering coefficient, the proposed algorithm can remove multiple links from the network at each iteration.However, the result of the algorithm is a mass of trivial partitions.To get a trade-off between accuracy and efficiency, Yang et al. [11] proposed an algorithm based on closed walks.However, the termination of the algorithm depends on the quality function modularity [16,26,27].
This paper focuses on three challenges: (1) how to detect external links efficiently, (2) how to remove external links efficiently, and (3) how to end a divisive algorithm with no help of predefined parameters or community definitions.Actually, if communities can distinguish between internal and external links, then communities can remove external links, keep internal links, and define themselves.Based on this idea, we present a concept of the weak link and autonomous division.The implementation of the autonomous divisive (AD) algorithm adopts a new link-break strategy similar to a tug-of-war contest.
The rest of the paper is organized as follows.Section 2 reviews related works of divisive algorithms.Section 3 introduces the proposed definitions and algorithm.We test our algorithm and compare it with other divisive algorithms in Section 4. Section 5 concludes our study.

Related Works
2.1.Betweenness (GN) Algorithm.Girvan and Newman [25] proposed the GN algorithm.In their work, they proposed betweenness focusing on the links that are most "between" communities.Each iteration of GN removes the link with the highest betweenness and then recalculates the betweenness of all the links affected by the removal.For further study, they considered alternative definitions of betweenness.Experimental results showed that the proposed algorithm based on the shortest path betweenness shows the best performance [10].

Distance Dissimilarity (DD) Algorithm.
Zhou [8] proposed the DD algorithm to quantify the differences between communities.Zhou introduced the dissimilarity index to measure the possibility that two adjacent nodes belong to the same community.Besides, Zhou also introduced a resolution threshold value known as the dissimilarity threshold.At each iteration of DD, the value of dissimilarity threshold decreases differentially.Based on the dissimilarity threshold, DD can remove multiple links at each iteration and get hierarchically organized communities characterized by upper and lower dissimilarity thresholds [8].

Information Centrality (IC)
Algorithm.Fortunato et al. [28] proposed the IC algorithm based on information centrality defined as the relative decrement of network efficiency caused by the removal of a link.IC expects the link locating between communities to have high information centrality and the link locating within a community to have low information centrality [28].Each iteration of IC accomplishes two tasks: calculating the information centrality of each link and removing the link with the highest information centrality.Experimental results showed that IC is effective at discovering community structures when the communities are cohesively connected with each other [28].
2.4.Edge-Clustering Coefficient (CD) Algorithm.Radicchi et al. [9] proposed the CD algorithm to solve two problems.The first problem is the quantitative definition of community, and the other problem is the time-consuming nature of divisive algorithms.To solve the first problem, they introduced two alternative quantitative definitions of community.To solve the second problem, they suggested a local centrality index edge-clustering coefficient.Based on the edge-clustering coefficient, CD can remove multiple links at each iteration.

Closed Walks (CW)
Algorithm.Yang et al. [11] proposed the CW algorithm and introduced closed walks as a local centrality index.CW considers the closed walks of orders three and four based on three convincing pieces of evidence.The first evidence comes from statistical data where, in complex networks, the proportion of the links that participated in closed walks of orders 3 and 4 reaches ninety percent [11].The second evidence comes from the three degrees of influence property of sociological significance [29].The third evidence comes from the property that information usually propagates along paths without repeated nodes.Experimental results showed that CW is an effective way to solve the double peak structure problem [11].

An Autonomous Divisive Algorithm for Community Detection
3.1.Motivation.In real-world networks, it is often easier to discriminate between internal links and external links than to recognize overlapping nodes [1].Defining communities as sets of links rather than nodes may be a promising strategy to analyze networks with overlapping communities [30,31].
Based on this idea, many community detection algorithms [32][33][34][35][36] aim to find the differences in the property of links to extract high-quality community structures from networks.
Based on network topology information, this paper discusses the difference between the properties of internal links and external links.We introduce a concept of the weak link to locate external links.In addition, we introduced a new link-break strategy and an autonomous division, so that the proposed divisive algorithm is free from parameters, nontopological information, and definition of community.

Definition of Weak
Link.Many real-world complex systems can be represented as a graph  = (, ). is the set of nodes,  is the set of links, || = , and || = .Most community detection algorithms are based on the notion that a community should have more internal connections than external connections [1,2].This notion skillfully generalizes the difference of the density distribution between internal links and external links.However, more properties of links are urgently needed to make divisive algorithms free from parameters, nontopological information, and community definitions.First, to tell the difference between the properties of internal links and external links, as a baseline, we investigate the expected contribution of a node to its neighbors for spreading information.If a node can only get its neighbors' information, then the node will expect that its neighbors' contribution for spreading information is uniform.We define the expected contribution that node  made to its neighbor  for spreading information as where   is the degree of node .Second, we investigate the property of internal links.In a community, core members, hub members, and outlier members play different roles in spreading information [37].Core members contribute greatly to spreading information inside communities; hub members serve as hubs for spreading information both inside and outside communities; outlier members prefer to receive information rather than send information.In a community, from core members to outlier members, the node's contribution for spreading information declines.Therefore, if node  and node  are two endpoints of an internal link and    is the real contribution of node  to its neighbor  for spreading information, we expect that  Lastly, we investigate the property of external links.The biggest difference between the properties of internal links and external links is that external links connect different communities.As the two endpoints of an external link play an important role in spreading information between communities, we expect that both of the endpoints have a real contribution which is greater than the expected contribution.Hence, if node  and node  are the two endpoints of an external link, we expect that    >    and    >    .We define the weak link as follows.
Definition 1 (weak link).A link  with two endpoints  and  is a weak link if    >    and    >    .

Determination of Weak Link.
To determine whether a link is a weak link, it is essential to quantify the real contribution of the two endpoints of a link for spreading information.Thus, we investigate the structure of the shortest path tree of each endpoint and introduce the shortest path coverage as a measure.We use the shortest path coverage to estimate whether a node is at the edge of a potential community based on the following observations.There are three subgraphs in Figure 1.The graphs in Figures 1(b) and 1(c) are the isomorphic graphs of Figure 1(a).In Figures 1(b) and 1(c), the solid lines present the shortest path tree of nodes  and .If we consider Figure 1(a) as a community, then node  is a core member.In Figure 1(a), there are eight links, and there are four links and six links in the shortest path tree in Figures 1(b) and 1(c).We can see that there are four links in Figure 1(b) and two links in Figure 1(c) which are presented as dashed lines making no contribution to the shortest path tree; besides, the length of the shortest path from the source node  is 4 and the length of the shortest path from the source node  is 6.From Figure 1, we can summarize that, in a community, a core member gets in touch more quickly with the other members than a less important member, and the depth of the shortest path tree of a core member is shorter than that of a less important member.
To calculate the shortest path coverage, we have to calculate the end-frequency and arrival-frequency.Definitions of end-frequency, arrival-frequency, and the shortest path coverage are shown in Definitions 2, 3, and 4. Examples of the calculation of the three concepts are shown in Figure 2.
Definition 2 (end-frequency).In the shortest path tree, the end-frequency of node  is the number of distinct shortest paths that start from source node  and end at node .Endfrequency is written as .
Definition 3 (arrival-frequency).In the shortest path tree, the arrival-frequency of node  is the number of distinct shortest paths that start from source node  and arrive at node .Arrival-frequency is written as .
Definition 4 (shortest path coverage).In the shortest path tree, suppose that node  is a neighbor of source node ; the shortest path coverage of node  is the proportion of the arrival-frequency of node  to the sum of the end-frequency of all the reachable nodes of source node .The shortest path coverage is written as .
The calculation of end-frequency is a top-down process using breadth-first search in time ().We show an example for calculating the end-frequency in Figure 2(a).The endfrequency of the source node is 1.In the shortest path tree, the end-frequency of a node is the sum of the end-frequency of all its parent nodes.For example, in Figure 2(a), there is one shortest path from node  to node 1 and one shortest path from node  to node 2, and then the end-frequency of node 3 is 1 + 1 = 2.The end-frequency is formulated as where "Parents" is the parent node set of node child, "parent" is a node in "Parents,"  child is the end-frequency of node child, and  parent is the end-frequency of node parent.The calculation of arrival-frequency is a bottom-up process in time ().We show an example for calculating the arrival-frequency in Figure 2(b).The arrival-frequency of a leaf node is its end-frequency.In the shortest path tree, the arrival-frequency of a node is its end-frequency plus the sum of its contribution to the arrival-frequency of its child nodes.In Figure 2(a), the end-frequency of nodes 2, 3, and 4 is 1, 2, and 1, respectively.In Figure 2(b), the arrival-frequency of nodes 3 and 4 is 4 and 3.The contribution of node 2 to the arrival-frequency of nodes 3 and 4 is 4 × 1/2 = 2 and 3 × 1/1 = 3.In Figure 2(b), the arrival-frequency of node 2 is 1 + 2 + 3 = 6.The arrival-frequency is formulated as where "Children" is the child node set of node parent, "child" is a node in "Children,"  parent is the end-frequency of node parent,  child is the end-frequency of node child,  parent is the arrival-frequency of node parent, and  child is the arrivalfrequency of node child.The shortest path coverage can be calculated in time (1).We show an example for calculating the shortest path coverage in Figure 2(c).For example, in Figure 2(b), the arrival-frequency of nodes 1 and 2 is 3 and 6; then, in Figure 2(c), the shortest path coverage of nodes 1 and 2 is 3/(3 + 6) = 1/3 and 6/(3 + 6) = 2/3.The real contribution of node  to its neighbor  for spreading information is given as where "Neighbors" is the neighbor set of node ,  is a node in "Neighbors,"   is the arrival-frequency of node , and is the shortest path coverage of .

Autonomous Division and Link-Break Strategy.
As shown in Section 2, several advanced algorithms have been proposed to detect communities in networks, but they all have certain limitations.For example, GN [25] and IC [28] are timeconsuming on large-scale networks; DD [8] depends on some parameters; CD [9] and CW [11] depend on the order of cyclic structures.Besides, all these algorithms have a common limitation that the output of these algorithms depends on quality function or community definition.We proposed linkbreak strategy and autonomous division to overcome these limitations.
(  To overcome the limitation on efficiency, a link-break strategy should have the ability to detect and remove multiple links at each iteration.To overcome the limitation on parameters and nontopological information, an autonomous division should take full advantage of the topology of the network.To overcome the limitation on quality function and community definition, an autonomous division should be able to terminate the algorithm when a satisfactory solution is reached.
The proposed link-break strategy is designed similarly to a tug-of-war contest.In the contest, communities act as contestants and links act as ropes.Weak links play the role of breakable ropes.When the force exerted on a breakable rope exceeds the rope's limit, the rope breaks.Then, the force exerted on the other ropes changes and other breakable ropes will continue to break.This process will repeat until there are no breakable ropes available in the system.Lastly, different communities get disconnected from each other.
Based on the weak link, autonomous division is easy to carry out.First, the concept of the weak link is proposed based on the topological properties of networks with a community structure.Second, based on the weak link, an algorithm has the ability to detect and remove multiple links at each iteration.

The Proposed Algorithm.
Based on the concepts of the weak link and autonomous division, the proposed algorithm repeats detecting and removing weak links, until no weak links are left in the network.We show the determination of the weak link in Algorithm 1.We show the AD algorithm in Algorithm 2.

Time Complexity Analysis.
Suppose AD algorithm works on a network with  nodes and  links.Based on the analysis in Section 3.3, at each iteration of the AD algorithm, the time complexity of the calculation of the shortest path coverage is ().Suppose that the number of potential weak links is  weak and the number of iterations is .Because at each iteration of Algorithm 2 multiple weak links can be removed from the network, according to step (5) to step (8), it can be inferred that 0 ≤  ≤  weak .In most free-scale networks,  ≪  weak ≪ , so the time complexity of AD algorithm is ().In sparse graph which has an obvious community structure, the time complexity of the AD algorithm is ( 2 ).We list the time complexity of the AD algorithm and the other five divisive algorithms mentioned in Section 2 in Table 1.

Experiments and Results
In this section, the effectiveness of the AD algorithm is compared with the other five divisive algorithms mentioned in Section 2 on both artificial and real-world networks.All the experiments are conducted on a computer with Intel(R) Core(TM) i3 CPU, 2.66 GHz, and 2 GB RAM.

Evaluation Criteria
4.1.1.NMI.The normalized mutual information (NMI) is a similarity measure proven by Danon et al. [38].NMI is based on defining a confusion matrix N, where the rows represent real communities and the columns represent detected communities.  is the element of N, which represents the number of nodes that belong to real community  and detected community . . is the sum of elements in row , and  . is the sum of elements in column .Based on information theory, a measure of similarity between the partitions is then where (, ) is the normalized mutual information,  represents the real partition,  represents the found partition,   is the real communities in , and   is the detected communities in .If the detected communities are identical to the real communities, then (, ) = 1.If the detected communities are totally independent of the real communities, then (, ) = 0. [10,25] proposed modularity () which is defined as

Modularity. Girvan and Newman
where  is the number of detected communities,  is the ID of community,   is the number of internal links of , and   is the sum of the degrees of the nodes within .This quality function measures the fraction of the links in the network that connect nodes of the same type minus the expected value of the same quantity in a network with the same community divisions but random connections between the nodes [10]. = 0 indicates that the number of links within the communities is only random. = 1 indicates the network with strong community structure.

I-Measure.
In this paper, we use I to evaluate the division efficiency of the algorithms.

𝐼 =
The number of iterations of the algorithm.( 7)

Data Sets
4.2.1.Artificial Networks.Lancichinetti-Fortunato-Radicchi (LFR) benchmark [39] produces networks with properties close to real-world networks.We use the LFR benchmark networks to test the algorithms.Some important parameters of the benchmark networks are given in Table 2.In Table 2,  denotes the number of nodes,  denotes the mean degree of the network, max  denotes the maximum degree of node, min  denotes the minimum size of community, max  denotes the maximum size of community, and  denotes the mixing parameter.For LFR  ,  ranges from 0.1 to 0.8 with a span of 0.1.For LFR  ,  ranges from 4 to 10 with a span of 1.

Real-World
Networks.The network of karate club (Karate) is a network of friendships between the 34 members of a karate club at a US university described by Zachary [40] in 1977.Zachary identified two communities of friendship in the network as shown in Figure 3.The network of bottlenose dolphins (Dolphins) is an undirected social network of frequent associations between 62 dolphins in a community living off Doubtful Sound compiled by Lusseau et al. [41].A link between two dolphins was established by observation of the statistically significant frequent association.The network comprises two communities as shown in Figure 4.
The network of political books (Books) was compiled by Krebs [42].The nodes represent 105 books on American politics brought from https://Amazon.com.441 links join pairs of books frequently purchased by the same buyer.The network is composed of three communities as shown in Figure 5.
The network of American football games (Football) between Division IA colleges during regular season Fall 2000 was compiled by Girvan and Newman [25].The network is composed of 11 conferences plus a few other teams without a clear affiliation as shown in Figure 6.

Experiment Results.
In our experiments, we ignore any quantitative definition of community and achieve the partition when Q gets the maximum value.This will make the CD get better results, while reducing the efficiency.Besides, to avoid the local adjustment process of distance dissimilarity algorithm, DD removes the links that have the highest dissimilarity value at each iteration.We note that CD3 and CD4 represent the edge-clustering coefficient (CD) algorithm in orders 3 and 4.

Results on Artificial
Networks.Figure 7 shows the results of the algorithms on LFR  data sets.The NMI values got by AD are about 0.15 lower than the average of the other algorithms.The  values got by AD are close to the average of the other algorithms.Figure 8 shows the results of the algorithms on LFR  data sets.When  is low, the NMI and  values got by AD are lower than those got by the other algorithms.However, when  increases, the NMI and  values got by AD explode, which means AD is more effective in discovering community structures when the communities are cohesively connected with each other.
From Figures 7 and 8, it seems that AD does not perform better than most of the other algorithms.Actually, all the other algorithms except for AD are guided by modularity  as mentioned in Section 4.3, paragraph 1, which means Figures 7 and 8 show the best performance of the other algorithms.However, AD is not guided by modularity  or any of the parameters, which means Figures 7 and 8 show the average performance of AD.Thus, we cannot say that AD performs worse than the other algorithms.
From Figures 7 and 8, we can observe that the  values got by AD are lower than those got by the other algorithms, which means that the link-break strategy of AD can reduce the number of iterations of divisive algorithm, thus improving the efficiency of the algorithm.Besides, we can observe that IC has the highest time complexity, which verifies the analysis of Table 1.Based on the time cost values got by the algorithms, we arrange the algorithms in an ascending order of time complexity: CD3 < CD4 < DD < CW < AD < GN < IC.

Results on Real-World Networks.
From Table 3, we can observe that, for Karate, AD gets the highest NMI value.Besides, AD also gets a higher  value than that of DD, CD4, and CW.For I-measure, AD algorithm gets the lowest  value.
From Table 4, we can observe that, for Dolphins, AD gets a higher NMI value than that of GN, IC, CD3, CD4, and  CW.Besides, AD gets a higher  value than that of DD.For -measure, AD gets the lowest  value.From NMI and  values in Table 4, we can also observe that NMI and  are independent of each other.NMI is used to evaluate the quality of a partition when the real community structure is known, while  is used to evaluate the quality of a partition when the real community structure is unknown.From Table 5, we can observe that, for Books, AD gets a higher NMI value than that of GN, CD3, and CW.Besides, AD gets a higher  value than that of DD, IC, CD4, and CW.For -measure, AD gets the lowest  value.
From Table 6, we can observe that, for Football, AD gets the lowest NMI and  value.There are two reasons for the poor results of NMI and .First, there are few teams without a clear affiliation.As shown in Figure 6, for the teams of conference "Independents," only teams 81 and 83 connected to each other.Second, some teams are more tightly connected with the teams from other conferences than the teams from the same conference.For example, all the teams of "Sun Belt" have more connections to the teams outside the conference than to the teams inside the conference.For -measure, AD gets the lowest  value.
From Tables 3, 4, 5, and 6, we can observe that AD performs better in identifying communities from real-world networks than identifying communities from artificial networks.There are two reasons for this phenomenon.First, AD is proposed based on the differences between the properties of internal links and external links in the real-world networks where the internal and external links exhibit different characteristics.Second, LFR benchmark simulates some features of real-world networks (the node degree and community size are in power distribution); however, it does not consider the differences between the properties of internal links and external links.Therefore, we have a reason to believe that AD performs better in identifying communities from realworld networks than identifying communities from artificial networks.

Conclusions
In this paper, we proposed a new divisive algorithm to overcome the limitations on parameters, nontopological information, division efficiency, and community definitions.
To make our algorithm free from parameters and nontopological information, we proposed the weak link which helps detect the links connecting different communities.To improve division efficiency, we proposed a link-break strategy based on the weak link, so that our algorithm could remove  multiple links at each iteration.To overcome the limitation on community definition, we introduced an autonomous division in our algorithm to end the algorithm without the help of community definitions.Empirical evaluations on artificial and real-world networks showed that the proposed algorithm achieves a better accuracy-efficiency trade-off than some of the latest divisive algorithms.61402126), (2) Heilongjiang Province Natural Science Foundation (no.F2015030), (3) Province in Heilongjiang Outstanding Youth Science Fund (no.QC2016083), and (4) Heilongjiang Postdoctoral Fund to Pursue Scientific Research in Heilongjiang Province (no.LBH-Z14071).

Figure 1 :Figure 2 :
Figure 1: Observations of the shortest path tree.

Table 1 :
Time complexity of divisive algorithms for community detection.

Table 2 :
The parameters of artificial networks.

Table 3 :
The results on Karate.

Table 4 :
The results on Dolphins.

Table 5 :
The results on Books.

Table 6 :
The results on Football.