Community Detection by Node Betweenness and Similarity in Complex Network

Community detection of complex networks has always been a hot issue. With the mixed parameters μ increase in network complexity, community detection algorithms need to be improved. Based on previous work, the paper designs a novel algorithm from the perspective of node betweenness properties and gives the detailed steps of the algorithm and simulation results. We compare the proposed algorithm with a series of typical algorithms through experiments on synthetic and actual networks. Experimental results on artiﬁcial and real networks demonstrate the eﬀectiveness and superiority of our algorithm.


Introduction
Complex networks have made significant affection in understanding complex systems [1]. At present, the application fields of complex networks include biology [2,3], epidemiology [4], sociology [5,6], transportation [7], etc. rough the study of these real networks, we can better understand the characteristics of the network and discover the properties of the network [8]. e network is composed of nodes and edges, which can represent various individuals and relationships in the real world. For example, the degree of nodes in the network can measure their importance [9,10], and another important aspect is the community structure in the network [11]. In most cases, nodes in the same community have similar attributes or closer links. e community structure is strongly significance for disease prevention, information dissemination, and product recommendation. However, the existing community discovery methods cannot fully meet the needs of current applications [12].
In recent years, many methods have been discovered about the community detection [13], including the label propagation algorithm (LPA), edge deletion methods, and spectral clustering. However, these algorithms are limited to a certain extent. For example, although the label propagation algorithm (LPA) runs faster, its results are unstable; although the accuracy of the GN algorithm is relatively high, the running time of the GN algorithm is much long. If the scale of the network is huge, then the GN algorithm will take a long time; although the spectral clustering algorithm has better ability to find communities in a simple network, the spectral clustering algorithm has high requirements on the similarity matrix, and different similarity matrices may get the different effect. Although the running time of the EM algorithm is much short, it is found that the accuracy of the community is decreasing as the complexity of the network increases. erefore, it is significant to detect the community structure of networks accurately. Based on the above information, we propose a method based on the combination of node betweenness and network structure similarity to discover complex networks. In the paper, the proposed method is divided into three parts: identifying the central node, the expansion of the community, and the integration of the communities. Firstly, the node's betweenness is used to determine influence of each node in the network and the initial number of communities is determined according to the average betweenness of nodes in the network. Secondly, the remaining nodes use the structural similarity between nodes to expand into the community corresponding to the central node. Finally, according to the number of nodes contained in the community, we divide the community into large communities and small communities. en, we integrate small communities into large communities according to the increase in modularity in the initially formed communities. e specific process is shown in Figure 1.

Basic Principle.
A community is a structure that exists in a complex network. In a community, the links of nodes are denser, and the links between communities are relatively sparse. Nodes influence each other in a complex network, but there is usually one central node in the community, which is conducive to the management and operation of the community [12]. Based on the above ideas, we use the betweenness of nodes [14] to find the central node in the community. e remaining nodes need to be filled into the community where the central node is located. We use structural similarity to fill other nodes, and the initial community structure has been discovered. Finally, it is the process of integrating the community. In this paper, the initial community is classified into the large communities and small communities according to the number of nodes in the community [15]. If the number of nodes in the community is greater than a certain threshold, the community is called the large community; otherwise, it is called the small community. is article uses modularity to integrate small communities.

Contributions.
Many classic community detection algorithms have been proposed. ey are suitable for different types of networks and need to set parameters, etc. Compared with existing community detection algorithms, the algorithm in this paper has the following contributions: (1) We use node betweenness to determine the number of communities and the initial nodes in the community. (2) e expansion of the community is achieved by comparing the structural similarity between the remaining nodes and all the nodes in the community. (3) We divide the initial community into large communities and small communities according to the number of nodes. e modularity is used to integrate small communities into large communities. e structure of the paper is as follows: in Section 3, the detailed introduction is given to the algorithm proposed in this article, including some basic concepts and detailed steps of the algorithm. In Section 4, we evaluate communities based on standardized mutual information and modularity, and we used artificial networks and some realistic networks to test the detected communities. In Section 5, the algorithm proposed in this paper is further summarized and discussed.

Related Work
In the past few decades, scholars have proposed many community detection methods in complex networks, such as label propagation algorithm [16], the method based on module optimization [17], spectral clustering method [18], the random walking method [19], and splitting hierarchical clustering method [20]. e label propagation algorithm is an algorithm proposed by Raghavan et al. [16] in 2007. e label propagation algorithm uses the label information of known nodes to predict the label information of unknown nodes. In the process of the label propagation algorithm, each node will randomly generate a label to represent its community at the beginning. In each process of predicting a node, the label of each node is predicted based on the label of its neighbor node. Labels with the same node will belong to the same community, and labels of different nodes will belong to different communities. In terms of calculation time, the running time of the label propagation algorithm is linear, but result of the label propagation algorithm is unstable.
Newman [17] proposed modularity in 2004, which is a function of the difference between the actual number of linked edges in the community and its random expectations. e higher the modularity, the better the community finds the algorithm, and many scholars use modularity as a target function to detect the community structure. Modularity optimization is a difficult problem for NP, and many scholars try to reduce its difficulty by various methods [21,22]. Based on the Newman [17], Newman [23] (2006) redefined the modularity and adopted a spectral method to solve the problem of optimizing the target function. However, for the existing modularity optimization method, the community discovery effect is not ideal as the complexity of the network becomes greater.
Mcsherry [18] proposes a method based on graph segmentation. In the view of the low expected value of the adjacency matrix, the spectral clustering is proposed to solve the problems of dichotomy, graph coloring, and small groups. Coja-Oghlan and Lanka [24] use spectral clustering methods, respectively, and solve the problem of community detection in sparse networks where the number of nodes increases but the node degree is bounded and networks where the node degree obeys a power-law distribution. Bshouty and Gentile [25] propose a spectral clustering algorithm to solve the community detection problem of the popular planting partition model. Rinaldo and Lei [26] proposed a method which is simple and easy to implement and has theoretical foundation support, so it is one of the most commonly used community detection methods. However, classic spectral clustering is not ideal when dealing with sparse networks. e spectral clustering relies on the similarity matrix, and final effect of different similarity matrices will be different. e concept of random walking first appeared in Pearson [19], he described random walking, and Rayleigh further extended it. Random walking has been widely used in many research fields and successfully introduced to community detection. e basic principle of random walk is that due to the dense degree of connectivity in the community, random walkers usually walk in the community for a long time.
erefore, a random roamer which is starting from a node will arrive at a node in the same community after a few steps.
2 Complexity e other methods based on random walks are easily extended to community detection in weighted networks, which is one of their advantages. Rosvall and Bergstrom [27] proposed the Infomap method based on information theory. e algorithm uses random walkers as the agent of information dissemination in the network. e performance of Infomap is extremely good, and it has been recognized and used by many scholars. However, the random walk algorithm has randomness, and it is very sensitive to the selection of initial nodes. It also has a great impact on community detection.
For hierarchical clustering algorithms, Girvan and Newman [28] calculate the betweenness of each edge in the network graph. However, the above algorithms still have problems such as long detection time and unstable detection results. erefore, Brandes [14] proposes a method of using node betweenness to detect the central node of the network and expand the remaining nodes to the community where the central node is located through structural similarity.

Community Detection Based on Node Betweenness
3.1. Preliminaries. We engage in unweighted, undirected networks in the paper. Let the G � (V, E) to be a graph, where V � v 1 , v 2 , . . . , v n represents the nodes, E � e 1 , e 2 , . . . , e n represents the edges, n and m, respectively, represent the number of nodes and edges in the network. Definition 1. Node betweenness: it reflects the influence of a node in the global network. e higher the node betweenness of a node, the more node pairs passing through the node and the more likely it is to be in the center of the community. Assuming a node v is a node in the network, the node betweenness of this node is calculated as follows [14]: where σ st (v) represents the number of shortest paths s ⟶ t through node v and σ st represents the number of shortest paths s ⟶ t. For a graph in a network, if the betweenness of the node is in an unweighted graph, the weight of the undirected graph is assigned a value of 1 when calculating the betweenness of the node.

Definition 2.
Structural similarity: the structural similarity of the nodes reflects the similarity of the two nodes in the graph network structure [12]: where x and y are nodes in the graph G, and N(x) and N(y), respectively, represent the set of nodes of the direct neighbors of node x and node y.
Definition 3. Average node betweenness: it reflects the average level of each node betweenness, it is used to determine the number of central initial community, and the average betweenness is calculated as follows:

Start
Step 1: Step 2: Step 3:  Complexity where n represents the number of all nodes in the entire network graph and C B (v i ) represents the betweenness of node v i . e average betweenness is used to find large communities and small communities among the initial community results of the network. Definition 4. Modularity: when the number of network communities is unknown, modularity is an indicator used to reflect the result of the detection of the community, and its calculation is as follows [12]: where d(v i ) represents the degree of node v i and A ij � a ij represents the adjacency matrix of the graph; if a ij � 1, node v i and node v j are connected, and if a ij � 0, it means that node v i and node v j are not connected. δ(C i , C j ) is an indicator function whose value is 0 or 1.
And m represents the number of edges in the network. Modularity can be used to detect the quality of community detection in a certain detection process.

Proposed
Algorithm. e algorithm consists of three steps: identifying the central node, expanding the community, and integrating the community.
In Step 1, we use node betweenness to search for the central node in the network graph. For the nodes V � v 1 , v 2 , . . . , v n , we calculate the betweenness of nodes by formula (1), and the nodes are Average Betweeness is the average betweenness, and C int is the set where the center node is located. For the node set v j1 , v j2 , . . . , v jn sorted by node betweenness, if e pseudo code is shown in Algorithm 1. In Step 2, we believed the similarity among nodes in the same community is larger than the others. Similarity could ensure that the core community consisted of nodes with high similarity and dense connection [12]. e expansion of the community adopts the method of maximizing structural similarity. e choice of structural similarity is discussed in detail in Section 5.2. e expansion of the community is shown by the following formula [12]: where v i and v j represent nodes in the network and C int is the set of nodes in all communities. In Step 3, for the initially formed community C � C 1 , C 2 , . . . C k , the following cases may occur. First, when there is only one node in the initial community, the node must be the central node of a certain initial community. e reason is that the similarity between central node and the nodes in Remain codes is less than that of other central nodes and the Remain codes. erefore, none of the remaining nodes have joined the community where this node is located. Second, after integrating the communities where a single node is located, there will still be cases where there are fewer nodes in some communities. In the case of a large network, communities will still exist in large numbers. It seriously affects the quality of community discovery. It is needed to further integrate small communities.
Hu et al. [2] adopted an evaluation method using the number of nodes in the community. He took the average number of nodes in the community as the metric for small communities, as follows: where |C i | represents the number of nodes in each initial community, k represents the number of initial communities, and S is the average number of nodes in each initial community. First, we should search for communities with nodes less than S in the community and integrate the communities. Modularity is introduced as an evaluation index to measure the quality of network community structure and to merge communities.
e higher the value of modularity Q, the better the community quality. erefore, it is necessary to find the community distribution of each node in the network so that Q is maximized. In the third stage, when the increase in the modularity of the two communities satisfies (7), merge the two communities C i and C j into one community C i ∪ C j , where C i and C j ∈ C: Any two communities are merged to ensure that the modularity Q is growing to the largest. Repeat this process until no small communities.

e Proposed Algorithm and Its Complexity.
e community detection algorithm consists of three parts: identifying the central node, expanding the community, and integrating the communities. In the first stage, it generally takes O(n 2 ) time in sorting node's betweenness and O(mn + n log n) time in calculating the node betweenness in undirected graph where m is the number of edges and n is the number of nodes. In the second stage, expanding the community, the process of computing the similarity between the neighbor, the time cost is O(k 2 ), where k is the average degree of the network. And in the last stage, integrating the community begins with the k big communities, and merging the k big communities into t communities, (t ≤ k ≤ n). It takes O(k 2 ) when the modularity comparison on the combination of k big communities is computed.
In summary, the algorithm's complexity costs O(n 2 + mn + n log n + O(k 2 )) ≈ O(n 2 ). Compared with some classic algorithms, the time complexity of the LPA algorithm is O(m), and the time complexity of the algorithm in the paper is larger than that of LPA. When the number of nodes is small, the time complexity of the two methods is almost the same. However, the GN algorithm is O((m 2 ) * n) and the EM algorithm is O (MN 3 ) where M is the number of iterations and N is the number of parameters, and the time complexity in the paper is relatively smaller than other methods.

Example.
e process of the algorithm is further explained for the dolphin network. e network of football is shown in Figure 2.
First of all, the central node needs to be selected in the network graph. e node betweenness was used to select the central node. e following is central node selected by node betweenness: the central node is as follows: . We calculate the similarity between the node and each central node and then assign nodes to the community where the central node with the greatest similarity is located. e final detection of the initial communities is shown in Table 1.
Regarding the results of the initial communities, it can be observed that some communities are composed of one node and some communities are composed of multiple nodes. e initial community is divided into large communities and small communities according to the average number of nodes in the community. If the number of nodes in the initial community is greater than the average number of nodes, it is defined the large community; it is defined the small community on the contrary. For small communities, they consist of one or multiple nodes, the nodes in the small communities are attributed to the large communities by modularity. Finally, the number of communities was determined and the small communities were integrated. e following shows the division of small communities and large communities, as shown in Tables 2 and 3.
In the process of integrating small communities into large communities, we calculate the modularity of the small communities and the large communities separately. If the modularity increases, the small community will be attributed to the larger communities; if multiple modularity increments are greater than 0, the community with the largest modularity is selected to join. For example, the community C 1 is integrated with each community in the large communities to calculate the increase in modularity, as shown in Table 4.
For the above table, when C 1 is integrated with C 12 , C 14 , C 17 , and C 20 , respectively, it is found that there are two integration cases where the modularity value is greater than 0. In this case, the node with the largest modularity increment is selected (C 1 and C 20 ). e final division of the community is shown in Table 5, and the visualization result is shown in Figure 3.

Experimental Results
rough the experiment of various algorithms, we compare the performance of various algorithms in the synthetic network and real network. e algorithm in the paper is implemented by PyCharm and NetworkX of Python.

Evaluation Metrics.
To compare the quality of different community detection algorithms, we used NMI and ARI to evaluate the results.

Normalized Mutual Information (NMI).
NMI is a measure of similarity between two communities. e calculation method is as follows: where X represents the real partition, Y is the partition found by the algorithm, C X is the number of actual communities, C Y is the number of communities found, N is the number of nodes in the network, N ij is the number of nodes shared by the real community i in partition X and the found community j in partition Y, N i represents the sum of the i-th row of matrix N ij , and N j represents the sum of the j-th column of matrix N ij .

Adjusted Rand Index (ARI)
. ARI is another indicator that compares two regions and is defined as follows: where n ij , a i , and b j are the values in the association table. ARI evolved from the Rand Index (RI), and its value range is [−1, 1]. e various values of ARI are described in detail in the literature.

Results and Discussion of Synthetic Networks.
In the experiment, we generated several synthetic networks with different characteristics to compare the performance of community detection algorithms. To keep the synthetic network consistent with the actual network, we use the LFR reference network, where the network characteristics are easy to control. Due to the controlled node degree distribution and community size distribution, they exhibit similar properties to real networks. In the LFR network, different networks can be generated by changing the parameters. One parameter that needs to be emphasized is μ, which is a mixed parameter. Each node shares a small part of 1-μ link with other nodes in its community and shares a small part of mu with other nodes in the network. Generally speaking, the LFR network has five structural parameters: the number of network nodes n, the average degree k, the maximum and minimum scales of the community maxc and minc, and the mixed parameter μ. Among all the parameters, the mixed parameter μ is the proportion of the links connecting each node in the community to the nodes in other communities in the total number of nodes.
To verify the proposed algorithm, the hybrid parameter μ and the number of nodes of the network are adjusted to verify the proposed algorithm. Firstly, the influence of mixing parameters on various algorithms is verified. e number of fixed nodes is 1000, the size of the community is between 10 and 20, the average degree k � 5, and the mixing parameters are adjusted to between 0.1 and 0.8. e results of the algorithm are shown in Figure 4. Figure 4(a) shows the result of NMI, and Figure 4(b) shows the result of ARI. As shown in Figure 4(a), as the mixing parameter μ increases, the NMI value of each algorithm decreases. As the mixing parameter μ increases, the structure of the artificial generation network (LFR) becomes more complicated and the community structure becomes fuzzy.
In Figure 4, the value of the normalized mutual information (NMI) of various algorithms changes with the mixing parameter μ. It can be known from Figure 4 that as Input: Graph G, Node set V, Link set E. Big community: the set of big scall community, Small community: the set of small scall community. C int : the node that their betweenness is over Average betweeness. Output: the detected communities C.
Step 1: identifying influent node (1) Ranking the node by their node betweenness decreasing.
Step 2: expanding the community (1) Calculate the degree of similarity between the Remain nodes and C int , and attribute the remaining nodes to the community where the highest similarity node is located. (2) e initial community C � C 1 , C 2 , . . . C k is formed.
Step 3: integrating the community (a) Suppose the S is the average number of nodes in the community.
For i in k: If |C i | < S; then the community |C i | is a small scall community; Small community � C 1 , C 2 , . . . C n (n <� k) (b) for C i in Big community: for C j in Small community: 6 Complexity the value of μ increases, the ability of various algorithms to discover the structure of the network community decreases. As the mixing parameters become larger, the structure of the artificial network community will become more complex.
In Figure 4(a), when 0.1 < μ < 0.3, the NMI value of the algorithm in this paper and the Louvain is larger than other algorithms. e NMI of GN algorithm, LPA algorithm, and spectral clustering is lower, and EM's NMI is the worst. When 0.3 < μ < 0.5, the Walktrap' s NMI is the biggest; the LPA and our method are almost the same. However, NMI of GN and spectral clustering is better EM's algorithm. When μ > 0.5, the value of NMI about our algorithm is better than other six algorithms. As a result, it shows that the proposed method has a good result in the process of increasing μ. Figure 4(b) shows the Adjusted Rand Index (ARI) of each algorithm with the increasing of the mixing parameter μ. It can be seen from Figure 4(b) that when 0.1 < µ < 0.3, the Louvain has best showing than other methods. e ARI value of the algorithm in the paper is almost the same as the ARI value of the GN algorithm, and the ARI values of other methods are lower than the algorithm in the paper and GN algorithm. When 0.3 < μ < 0.5, the Walktrap has better result than other methods. e ARI of the method in paper, GN, are fluctuating. But the ARI of the method in paper occupies the main position. After the mixing parameter μ > 0.5, the ARI value of the method in the paper is larger than other algorithms.
When the mixing parameters increase, the value of NMI and ARI of the algorithm in the paper is the largest when μ > 0.5. It is better for the method in the paper than other methods.
In Figure 5, the mixed parameter of the fixed network is μ � 0.5, and the size of the community is min � 10 and max � 20. As the number of nodes increases, the tendency in the NMI and ARI of various algorithms is shown in Figures 5(a) and 5(b), respectively.
In Figure 5(a), when the number of nodes is increasing, the NMI of the method in the paper and the LPA keeps rising, when the node's number is increased from 1000 to 9000. When 1000 < n < 3000, the value of NMI of the  51, 15 C 10 57 C 11 39, 48 C 13 29 C 15 14 C 16 33, 44, 99 C 18 38, 50 C 19 23, 59 C 21 63, 97  Integration communities Node's index of community C 1 and C 12 −0.0003  Complexity proposed algorithm is the best. However, the LPA's NMI is almost the same as our algorithm. Louvain and Walktrap have the same tendency. e EM's NMI is the worst. When 3000 < n < 6000, the LPA's NMI is the best, and the proposed algorithm and GN algorithm are the almost same. e Walktrap's NMI has decreasing tendency, and the Louvain keeps the previous tendency. When n > 6000, our method's NMI is the best. e performance of other methods keeps previous tendency.
In Figure 5(b), it can be seen that the proposed algorithm and the GN have great advantages. As the number of nodes increases, when 1000 < n < 5000, the value of ARI of Louvain has the best performance, and the proposed algorithm's NMI is bigger than the GN algorithm. When 5000 < n < 7000, the ARI of the proposed method is also the best, and the difference of other algorithms' ARI behaves similarly. When n > 7000, all six methods have almost the same performance.
When the number of nodes increases, the NMI of the proposed method is better than other methods.

Results and Discussion of Real-World Network.
We also evaluated the effect of the algorithm on some typical real networks. e network data are logged in "http://www.personal. umich.edu/∼mejn/netdata/" and "https://github.com/ zzz24512653/CommunityDetection/tree/master/network." e details of these networks are shown in Table 6. |V| represents the number of nodes of the network, |E| represents the number of edges of the network, k represents the average degree of network, β th represents epidemic threshold, AC represents average clustering coefficient, Assortative represents assortative coefficient, and DH represents degree heterogeneity. e community results of the algorithm in the paper are shown in Tables 7-10. e modularity of real networks is shown in Table 11 and Figure 6.
e Karate network is a well-known real social network, and it reflects the differences between club management and coaches through the friendship between community members. e method strictly divides the club into two communities, and most nodes can be divided correctly. e modularity of the proposed algorithm and the EM is higher, the modularity of the other algorithms is relatively low, and community detection results are shown in Figure 7.
e nodes of the Dolphin network represent dolphins, and the edges represent the relationship between dolphins. From the perspective of modularity, the proposed algorithm and the spectral clustering (SC) have higher modularity. However, modularity of the LPA, EM, and GN is much small. e community structure obtained by the proposed algorithm is shown in Figure 8.
e Football network comes from the American football match in the fall of 2000 [28]. e network uses nodes to represent teams and edges represent matches between teams. In this network, the modularity of the proposed method and the spectral clustering is much high, while the modularity of the remaining methods is highly low. e community structure is shown in Figure 9.
e Pollbooks network is a network of American political books. e nodes represent American political books, and the edges represent two books purchased by the same     buyer. In the network, the modularity of the proposed method is less than the modularity of the spectral clustering.
Other methods have lower modularity. e community structure is shown in Figure 10.      10 Complexity

Discussion and Conclusions
e proposed algorithm is composed of three steps: identifying the central node, expanding the community, and integrating the initial community. e proposed algorithm is mainly compared with six algorithms: GN, label propagation (LPA), EM, spectral clustering, Louvain, and Walktrap. Among them, the GN, label propagation (LPA), spectral clustering, Louvain, and Walktrap are commonly used methods for community detection. e EM is not a commonly used algorithm for community detection, but it is usually used for clustering in machine learning. In this part, the selection and conclusion of similarity are in Section 5.1.

Similarity Comparison.
In the second step of this algorithm, the choice of similarity has much impact on community.
erefore, different similarity measures are selected to test the different results of proposed algorithm. It is listed common similarity measures in Table 12. In Figure 10, PA and RA have the worst performance, because it is usually used to quantify the functional importance of edges. It is more suitable for dynamic networks. Compared with other similarities, Sφrenson similarity gives better results. Based on the above results, our algorithm uses Sφrenson similarity to detect similar nodes (see Figure 11 and Table 13).

Conclusion.
Community detection is a much significant research topic in social network's analysis. e goal of community detection is to divide the network into several communities, and each node belongs to only one community. However, when μ increases, the NMI value drops rapidly in existing algorithms. In the paper, we divide the process of community detection into three stages, and we use the method that expands the central node to detect communities. We test our methods on several synthetic and real datasets and compare the results with the previous methods.
e experimental results show that our method is more efficient. And as μ increases, the NMI value of our method is higher than other methods.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.