Overlapping Community Detection Algorithm Based on High-Quality Subgraph Extension in Local Core Regions of Network

Community structure is an important feature of complex networks. Detecting overlapping communities in complex networks is a hot research topic in data mining and graph theory, aiming at the shortcomings of community detection algorithm based on seed expansion, such as the instability of community detection results caused by randomly selecting seeds, the similarity of selected seeds leading to similar communities after di ﬀ erent seed expansion, and the increase of calculation caused by deleting nodes in the process of seed expansion. This paper proposes an overlapping community detection algorithm based on high-quality subgraph extension in local core regions of the network (OLCRE). First, a novel seed community selection method is designed. By analyzing the sum of node degrees of the subgraph formed by a node and its neighbor nodes in the local core region of the network and the tightness of the internal and external connections of the subgraph, a seed community selection function is proposed. According to this function, high-quality subgraphs are selected from all the local core regions of the network as seed communities. Then, taking the de ﬁ nition of community as the guideline, a new community expansion strategy is proposed. Considering the in ﬂ uence of the neighbor node on the inner and outer connection tightness of the seed community comprehensively, it is determined whether the neighbor node can join the seed community. Finally, after the completion of all seed community expansion, overlapping nodes and possible missing nodes should be simpli ﬁ ed and redetected to further improve the quality of community detection. The proposed algorithm is tested on the arti ﬁ cial and real-world networks and compared with several overlapping community detection algorithms. The experimental results verify the e ﬀ ectiveness and feasibility of the proposed algorithm.


Introduction
Many complex systems in the real world can be shown in the form of complex networks through their connection modes. Components in the system can be regarded as nodes in the network, and the connection relations between different components can be regarded as edges in the network, for example, the social network [1] that is interconnected among people and the metabolic network [2] that is connected through chemical reactions. With the further study of complex networks, it is found that community structure is the basic statistical characteristic among them. A community in a complex network can be understood as a collection of nodes with similar characteristics, which is usually repre-sented by close connections between nodes within the same community, while sparse connections between nodes in different communities. The purpose of community detection studied in this paper is to reveal the real community structure in complex networks, which has important theoretical significance and practical value for the topological structure analysis and functional analysis of complex networks [3]. At present, the research achievements of community detection have been widely applied in the public opinion analysis and control [4], search engines [5], personalized interest recommendation [6], and other fields. In addition, in view of the actual needs of epidemic transmission prevention and control [7], the community structure, which is between the macro-and micronetwork characteristics, is taken as the entry point, and community detection of social networks is combined with the epidemic transmission, so as to provide important information about the transmission risk class of persons involved in the epidemic for epidemic transmission prevention and control.
Up to now, many classical complex network community detection algorithms have been proposed, which can be divided into two categories according to their community detection results: nonoverlapping community detection algorithms and overlapping community detection algorithms. The nonoverlapping community detection algorithm divides the complex network into multiple disjoint communities. However, in realworld networks, there exist overlapping communities; that is, a node can belong to multiple communities at the same time. In a social network, for example, a person may belong to multiple social circles (family circle, friend circle, and colleague circle). Therefore, the detection of overlapping communities in complex networks has more practical value. According to the different research perspectives, the overlapping community detection algorithms are mainly divided into algorithms based on label propagation, algorithms based on cliques, algorithms based on local extension, algorithms based on edge, algorithms based on nonnegative matrix factorization, and algorithms based on spectral clustering.
Algorithms based on label propagation, for example, the SLPA algorithm [8], firstly initialize labels for nodes in the network and then carry out label propagation. The storage space of each node will save all labels received in the process of label propagation. In order to prevent too many overlapping nodes, the label control threshold is set to determine which labels will be saved in the storage space of nodes. After label propagation stops, nodes with the same label are divided into the same community, and nodes with multiple labels are considered overlapping nodes. OMKLP algorithm [9] proposed a new core node evaluation model by analyzing the node degree and local coverage density of the subgraph formed by this node and its neighbor nodes and assigned the same label to the core node and its neighbor nodes to achieve fast convergence of the algorithm. In the process of label propagation, each node adopts an asynchronous update to receive the community label corresponding to the maximum belonging coefficient of its neighbor nodes. After label propagation stops, nodes with the same label are divided into the same community, while nodes with multiple community labels are overlapping nodes.
Algorithms based on cliques, for example, the CPM algorithm [10], start from the complete subgraph and detect the community through the penetration of the complete subgraph. The nodes belonging to multiple disconnected cliques are overlapping nodes. LOC algorithm [11] firstly finds out all the cliques in the network and selects the local maximum density node as the initial community. Then, the clique participated by the node whose fitness function value is positive among the neighbor nodes of the initial community is added to the community. If the node does not participate in the formation of the clique, only the node is added to the community. Because a node can belong to multiple cliques and be added into different communities, overlapping community structures can be detected.
Algorithms based on local extension, for example, the LFM algorithm [12], start from different seed nodes and expand the community by constantly optimizing the fitness function value of the community. The nodes that are extended into multiple communities are overlapping nodes. The ECES algorithm [13] weights the network graph according to the similarity between nodes and then selects the node with the highest centrality value as the core node and expands it. This process is repeated in the remaining set of nodes until there are no nodes left.
Algorithms based on edge, for example, the LC algorithm [14], use the Jaccard function to calculate edge similarity, construct a hierarchical tree of edge community combined with the clustering method, and then truncate the hierarchical tree to obtain edge community by using partition density function. Since a node can connect multiple edges, overlapping nodes appear naturally when the community to which the edge belongs is determined. Finally, the edge community is transformed into a node community to obtain the structure of the overlapping community. LCDEL algorithm [15] firstly transforms the node graph into the line graph, constructs the adjacency matrix of the line graph, calculates the node distance matrix of the line graph using the NDML metric, and obtains the feature matrix of the node distance matrix by principal component analysis. Finally, clustering on the feature matrix by k -means clustering algorithm combined with ensemble learning is performed to obtain the overlapping community structure.
Algorithms based on nonnegative matrix factorization, for example, the DNMF algorithm [16], directly find the discrete community membership matrix, which can assign explicit community memberships to nodes without postprocessing. In addition, the pseudosupervision module is added to DNMF to utilize the identification information in an unsupervised way, which further enhances its robustness. The AGNMF-AN algorithm [17] uses an augment attributed graph to combine both the topological structure and attributed nodes of the network and introduces an effective framework to update the affinity matrix, in which the weight of the affinity matrix in each iteration is modified adaptively instead of using a fixed affinity matrix. In addition, the l 2,1 -norm is also used to reduce the impact of random noise and outliers on the community quality, which greatly improves the effectiveness of this algorithm.
Algorithms based on spectral clustering, for example, the SPOC algorithm [18], can extract prior information such as the likelihood of each node belonging to multiple communities from available metadata and node centrality measure, and a hierarchical algorithm is introduced to automatically detect communities. The ASC algorithm [19] constructs a new affinity matrix based on both the network structure and attribute information and does not need to define control parameters to combine structure and attribute. In addition, extra nodes and edges are not added to the original network which makes the algorithm suitable for application to large-scale networks.
In recent years, local community detection algorithms based on seed extension can detect communities without the complete structural information of complex networks and have high efficiency [20][21][22] and validity [23][24][25], so it 2 Wireless Communications and Mobile Computing is widely used in the field of community detection. However, in terms of overlapping community detection, there are still shortcomings in the quality and stability of community detection, which are manifested as the instability of community detection results caused by randomly selecting seeds, the similarity of selected seeds leading to similar communities after different seed expansion, and the increase of calculation caused by deleting nodes in the process of seed expansion. In view of the above shortcomings, this paper proposes an overlapping community detection algorithm based on high-quality subgraph extension in local core regions of the network (OLCRE). The major contributions of this paper are as follows: (1) A new method of seed community selection is proposed; that is, the subgraphs with tight internal connections and sparse external connections in the local core regions of the network are selected as seed communities, which conforms to the definition of community and ensures the high quality of selected seed communities. Moreover, the selected seed communities by this method are determined, which can avoid the wobble of the community detection results (2) A new seed community expansion strategy is proposed, which takes the definition of community as the guideline. Considering the influence of the neighbor node on the tightness of the internal and external connections of the seed community comprehensively, it is to decide whether the neighbor node can join the seed community, so that the seed community would expand towards the direction of tight internal connections and sparse external connections and finally obtain high-quality community structure (3) The OLCRE algorithm proposed in this paper does not need to set any parameters. It can be applied to networks of different scales and types and has universal applicability. The experimental results show that the OLCRE algorithm is effective and feasible, which is tested on artificial networks and realworld networks and compared with several overlapping community detection algorithms

Basic Concepts and Definitions
A complex network can be modeled as an undirected and unweighted graph G = ðV, EÞ, where V = ðv 1 , v 2 , ⋯, v n Þ is a nonempty finite set of nodes and E = ðe 1 , e 2 , ⋯, e m Þ is a nonempty finite set of edges. Table 1 lists the notations used in this paper and gives a brief explanation. The basic concepts and definitions used in this paper are described below.
Definition 1 (Seed community selection function). The seed community selection function, denoted by SCSðiÞ, is defined as follows: where SG represents the subgraph formed by node i and its neighbor nodes and k v represents the degree of node v. |e v1,v2 | = 1 if there is an edge connection between nodes The larger the value of SCSðiÞ corresponding to the subgraph SG formed by node i and its neighbor nodes, the more located the subgraph is in the local core region, and the more tightly connected the subgraph is internally and sparsely connected to the external region.
Definition 2 (Common neighbor edge). The common neighbor edge of edge e i,j , denoted by CNEðe i,j Þ, is defined as follows: where adjðiÞ is the set of neighbor nodes of node i and adjðjÞ is the set of neighbor nodes of node j.
Definition 3 (Cluster triangle). The cluster triangle in which edge e i,j participates, denoted by CTðe i,j Þ, is defined as follows: where CTðe i,j Þ represents the set of cluster triangles in which edge e i,j participates. The more cluster triangles an edge participates in, the tighter the edge is connected to its neighbor edges. The more cluster triangles exist in the community, the tighter the connection within the community.
Definition 4 (Node to the community interior influence function). The node to the community interior influence function, denoted by I, is defined as follows: where E c is the edge set of community C and, likewise, E c ' is the edge set of community C ' formed when a neighbor node joins community C. |CTðe i,j Þ | represents the number of cluster triangles in which edge e i,j participates, and n c represents the number of nodes in community C.
If the corresponding I value is greater than 0 after a node joins community C, it indicates that the node joining community C can improve its internal connection tightness.
Definition 5 (Community boundary nodes). The boundary nodes of community C, denoted by bv, are defined as follows: where V c is the node set of the community C.
Definition 6 (Node to the community exterior influence function). The node to the community exterior influence function, denoted by E, is defined as follows: where V c is the node set of the community C and, likewise, V c ' is the node set of the community C ' formed when a neighbor node joins community C. |C ' bv | represents the number of boundary nodes of community C ' , and |C bv | represents the number of boundary nodes of community C. |e bv,u | = 1 if there is an edge connection between boundary node bv and node u. Otherwise, |e bv,u | = 0.
If the corresponding E value is less than 0 after a node joins community C, it indicates that the node joins community C to make its connections with the outside more sparse.
Definition 7 (Community quality optimization function). The community quality optimization function, denoted by M, is defined as follows: The community quality optimization function is used to simplify overlapping nodes and redetect possible missing nodes so as to further improve the quality of community detection results.

The OLCRE Algorithm
3.1. General Description of the OLCRE Algorithm. As shown in Algorithm 1, the OLCRE algorithm firstly traverses the global network and, according to the seed community selection function SCS, selects the subgraphs with close internal connections and sparse external connections from the local core regions of the network as seed communities. In the seed community expansion stage, the influence of the neighbor node on the inner and outer connection tightness of the seed community is comprehensively considered to determine whether the neighbor node could join the seed community. When the corresponding I value and E value of a neighbor node of the seed community meet the requirements of I > 0 and E < 0, the neighbor node can join the seed community. Otherwise, it cannot join the seed community. When all neighbor nodes of a seed community do not meet the expansion strategy, the seed community stops expanding and continues to expand the rest of the seed communities until all the seed communities complete expansion. After the expansion of all seed communities is completed, overlapping nodes and possible missing nodes are simplified and redetected according to the proposed community quality optimization function, so as to further improve the quality of community detection. Finally, the output is the overlapping community structure C. Through the above steps, the overlapping community detection of complex networks is completed.

Seed Community Selection.
Seed selection is a key step of overlapping community detection algorithm based on seed expansion, which has an important impact on the results of community detection. In this paper, a novel seed community selection method is proposed. According to the seed community selection function SCS, subgraphs with close internal connections and sparse external connections are selected from local core areas of the network as seed communities, see Algorithm 2 for the specific process.
The seed community selection algorithm first starts from any node i in the network and calculates the respective SCS values of node i and its neighbor nodes, respectively. If the SCS value of node i is not the largest, the search will continue along the direction of the maximum SCS value. After the 4 Wireless Communications and Mobile Computing node with the maximum SCS value is found in a region, the subgraph formed by this node and its neighbor nodes is regarded as a seed community. If the node with the maximum SCS value is not unique, a node is randomly selected. Then, the search for seed communities continues in unvisited areas of the network until all nodes in the network have been traversed. Finally, the subgraphs with tight internal connections and sparse external connections in all local core regions of the network have been searched and used as seed communities.

Seed Community Expansion.
In the stage of seed community expansion, a novel seed community expansion strategy is designed according to the proposed node to the community interior influence function I and node to the community exterior influence function E, see Algorithm 3 for the specific process.
The new seed community expansion strategy is as follows: select any neighbor node i of the seed community and calculate the corresponding I value and E value of the node. If IðiÞ > 0 and EðiÞ < 0 are satisfied, the node will be added to the seed community; otherwise, it cannot be added to the seed community. When all neighbor nodes of the seed community do not meet the expansion strategy, the seed community stops expanding and then continues to expand the rest of the seed communities until all the seed communities have completed the expansion.

Dealing with Overlapping Nodes and Missing Nodes.
After the expansion of all seed communities is completed, Input : Graph G = (V, E) Output : Overlapping community structure C 1: C = ∅; 2: According to seed community selection algorithm (Algorithm 2), seed community set, denoted by Seeds, are selected from network G; 3: Select any seed community, denoted by s, and go to Step 4 if Seeds ≠ ∅. Otherwise, go to Step 5; 4: Remove s from Seeds, and then expand it into a community structure C s according to the seed community extension algorithm (Algorithm 3), and add C s to C, returning to Step 3; 5: Simplify and re-detect overlapping nodes and possible missing nodes; 6: Output overlapping community structure C; if node i has been accessed then 4: continue; 5: else 6: mark node i as visited; 7: end if 8: max ← the SCS value of node i is calculated; 9: while true do 10: value ← SCS values of all neighbor nodes of node i are calculated, and all neighbor nodes are marked as visited. The node with the maximum SCS value is selected. If the node with the maximum SCS value is not unique, a node j with the maximum SCS value is randomly selected; 11: if max >= value then 12: Seeds ← the subgraph formed by node i and its neighbor nodes serves as a seed community; 13

Wireless Communications and Mobile Computing
if any node is not added to the community, the missing node will be added to the community with its corresponding maximum M value according to the community quality optimization function M. In addition, in order to prevent the excessive overlapping phenomenon from affecting the quality of community detection, it is necessary to simplify the detected overlapping nodes. The M value of the overlapping node corresponding to the community where it is located is calculated, respectively. If the M value is positive, the overlapping node is kept in the community where it is located; if the M value is negative, the overlapping node is removed from the community where it is located. When the M values of the overlapping node corresponding to the communities where it is located are all negative, it will be added to the community with the corresponding largest M value.

Time Complexity Analysis.
Assume that the number of nodes in network G is n and the average degree of nodes is k. The number of seed communities, the number of overlapping nodes, and the number of missing nodes detected by the OLCRE algorithm are r, o, and l, respectively. Firstly, high-quality subgraphs are selected from local core areas of the network as seed communities, whose time complexity is Oðk 2 nÞ. After that, the time complexity for all seed communities to complete the extension is Oðk 2 r + k 2 nÞ. Finally, the time complexity of simplifying and redetecting overlapping nodes and missing nodes is Oðkor + klrÞ. To sum up, the time complexity of the OLCRE algorithm is Oð2k 2 n + k 2 r + kor + klrÞ. Since r, k, l, and o are far less than n, the time complexity of the OLCRE algorithm is about Oðjk 2 nÞ, where j is a constant.

Experimental Data Sets
4.1.1. Artificial Networks. Since the LFR benchmark network [26] is very similar to the real-world complex network in the statistical characteristics of node degree and community size distribution, this paper uses this benchmark network as the test data set for the proposed algorithm and other comparison algorithms. The parameters of the LFR benchmark network are shown in Table 2.
In order to objectively reflect the performance of each algorithm, four groups of different types of artificial networks (see Table 3) are generated by changing the mixing parameter μ, the number of overlapping nodes O n , the number of memberships of the overlapping nodes O m , and the number of nodes in the network n by using the LFR toolkit. They, respectively, are artificial network group N1 with a gradually fuzzy community structure, artificial network group N2 with a gradually increasing number of overlapping nodes, artificial network group N3 with a gradually increasing number of communities to which overlapping nodes belong, and artificial network group N4 with a gradually increasing number of nodes.

Real-World Networks.
In order to compare the performance of each algorithm in detecting network community structure, seven real-world network data sets of different sizes and types are used in this paper. They, respectively, are the Zachary karate club network (Karate for short) [27], bottlenose dolphin network (Dolphins for short) [27], books about US politics network (Polbooks for short) [28], Input : Graph G = (V, E), Seed community set Seeds Output : Overlapping community set C 1: C = ∅; 2: for each s ∈ Seeds do 3: C s = s; 4: While true do 5: select any neighbor node i of the seed community C s ; 6: if I(i) > 0 and E(i) < 0 then 7: if all neighbor nodes of seed community C s do not satisfy I > 0 and E < 0 then 10: break; 11: end if 12: end while 13: C = C ∪ C s ; 14: end for Algorithm 3: Seed community expansion algorithm.

Parameters
Meaning n The number of nodes in the network k The average degree of nodes k max The maximum degree of nodes μ The mixing parameter C min The number of nodes in the smallest community C max The number of nodes in the biggest community O n The number of overlapping nodes

O m
The number of memberships of the overlapping nodes 6 Wireless Communications and Mobile Computing US election blog network (Polblogs for short) [27], author collaboration network (Netscience for short) [27], trust network (PGP for short) [29], and friendship network (HR for short) [30]. The details of the seven real-world networks are listed in Table 4.

Evaluation Metrics.
Since the community structure of the artificial network is known, normalized mutual information (NMI for short) [12] is used as the evaluation metric of artificial network community detection results. NMI is used to measure the similarity between the community structure detected by the algorithm and the real community structure, and its value range is [0,1]. The more accurate the community structure detected by the algorithm, the larger the corresponding NMI value. The NMI is defined as follows: where C N is the number of real communities in the artificial network and C D is the number of communities detected by the algorithm on the artificial network. The rows of matrix M correspond to the real community results of the artificial network, and the columns of matrix M correspond to the community results detected by the algorithm on the artificial network. M xy is the number of overlapping nodes between the real community x and the community y detected by the algorithm. M x · is the sum of elements of M in row x and M ·y is the sum of elements of M in column y.
Since the community structure of the real-world network is unknown, the extend modularity (EQ for short) [31] is adopted as the evaluation metric of the community detection results of the real-world network. EQ is used to measure the tightness of community connection, and its value range is [0,1]. A higher EQ value means that the community quality detected by the algorithm is better. The EQ is defined as follows: where m is the number of edges in the network. c is the number of communities detected by the algorithm in the real-world network. O i is the number of communities to which node i belongs, and k i is the degree of node i. A ij is an adjacency matrix element of the network. A ij = 1 if there is an edge connection between nodes i and j. Otherwise, A ij = 0.

Experimental Settings.
The OLCRE algorithm is tested on artificial network data sets and real-world network data sets and compared with overlapping community detection algorithms DNMF [16], CoEuS [32], MULTICOM [33], and APAL [34] to verify the effectiveness and feasibility of the OLCRE algorithm. The experimental running environment is a computer equipped with an Intel Core i9-11900K 3.50 GHz processor, 32 GB memory, and Windows 10 operating system. The algorithm proposed in this paper is programmed by MATLAB R2021a, and the source code has been publicly shared and is available at https://github.com/ GitZhaoY/OLCRE.git. Table 5 lists the year, programming language, and time complexity of each comparison algorithm, where m represents the number of edges in the network, n represents the number of nodes in the network, s represents the number of seeds, h represents the number of nodes within the seed community, c represents the number of communities, and t represents the number of iterations. From the data listed in Table 5, it can be seen that both the CoEuS algorithm and the MULTICOM algorithm have linear time complexity, which is on the same order of magnitude as the time complexity of the OLCRE algorithm proposed in this paper. The time complexity of the DNMF algorithm is Oðn 2 Þ order of magnitude, which is significantly higher than that of the OLCRE algorithm. The time complexity of the APAL algorithm is Oðm 3 /n 2 Þ, which indicates that it has good operating efficiency on sparse networks and is not suitable for dense networks. Figures 1-3 and Table 6, respectively, show the comparison results of the evaluation metric NMI obtained by each algorithm running  on four groups of different types of artificial networks. In the network group N1, with the increase of μ value, that is, the network community structure is gradually blurred, the community detection accuracy of each algorithm decreases, but the community detection accuracy of the OLCRE algorithm is better than that of each comparison algorithm under different μ values. In the network group N2 with a gradually increasing number of overlapping nodes and the network group N3 with a gradually increasing number of communities to which overlapping nodes belong, the community detection accuracy of the OLCRE algorithm is better than that of each comparison algorithm. From the experimental data listed in Table 6 ("\\" means that the algorithm failed to detect communities in this experimental running environment), it can be seen that the community detection accuracy of the OLCRE algorithm is relatively stable and better than that of each comparison algorithm in the network group N4 with gradually increasing number of nodes.

Experimental Results on Artificial Networks.
According to the above experimental results, it is shown that the seed community selection method and community expansion strategy of the OLCRE algorithm proposed in this paper are effective and can be applied to networks of different scales and types.

Experimental
Results on Real-World Networks. Table 7 lists the results of EQ values obtained by the OLCRE algorithm and other four overlapping community detection algorithms running on seven real-world network data sets ("\\" means that the algorithm failed to detect communities in this experimental running environment). As can be seen from the experimental results listed in Table 7, the EQ values obtained by the OLCRE algorithm on the Dolphins network, Polbooks network, Polblogs network, Netscience network, PGP network, and HR network are all higher than those obtained by each comparison algorithm. The EQ value obtained by the OLCRE algorithm only on the Karate network is slightly lower than that obtained by the DNMF algorithm.
The reason why the OLCRE algorithm does not obtain the maximum EQ value on the Karate network is analyzed below. Figures 4 and 5    Wireless Communications and Mobile Computing network, while the OLCRE algorithm detects node 3 as the overlapping node, which loses some connection tightness. Therefore, the EQ value obtained by the OLCRE algorithm is slightly lower than that obtained by the DNMF algorithm.

Conclusions
The OLCRE algorithm proposed in this paper firstly selects high-quality subgraphs from all local core regions of the network as seed communities according to the proposed seed community selection function. Then, the seed communities are expanded in turn according to the proposed expansion strategy. Finally, after the completion of all seed community expansion, overlapping nodes and possible missing nodes should be simplified and redetected to further improve the quality of community detection. In this paper, four groups of artificial networks with different types and scales are designed and compared with several overlapping community algorithms. The community detection accuracy of the OLCRE algorithm on these four groups of artificial networks is better than that of each comparison algorithm. In the experiments on seven real-world networks, the OLCRE algorithm only fails to obtain the maximum value of EQ on the Karate network, and the results on the other six real-world networks are all higher than those of the comparison algorithms. In conclusion, the experimental results verify that     9 Wireless Communications and Mobile Computing the OLCRE algorithm is effective and feasible. In addition, the OLCRE algorithm does not need to set any parameters and only needs to master the basic network information (nodes and edges) to complete the detection of overlapping communities. It can be applied to networks of different scales and types and has universal application.