Identifying Key Nodes in Complex Networks Based on Local Structural Entropy and Clustering Coefficient

Key nodes have a signicant impact, both structural and functional, on complex networks. Commonly used methods for measuring the importance of nodes in complex networks are those using degree centrality, clustering coecient, etc. Despite a wide range of application due to their simplicity, their limitations cannot be ignored. e methods based on degree centrality use only rst-order relations of nodes, and the methods based on the clustering coecient use the closeness of the neighbors of nodes while ignore the scale of numbers of neighbors. Local structural entropy, by replacing the node inuence on networks with local structural inuence, increases the identifying eect, but has a low accuracy in the case of high clustered networks. To identify key nodes in complex networks, a novel method, which considers both the inuence and the closeness of neighbors and is based on local structural entropy and clustering coecient, is proposed in this paper. e proposed method considers not only the information of the node itself, but also its neighbors. e simplicity and accuracy of measurement improve the signicance of characterizing the reliability and destructiveness of large-scale networks. Demonstrations on constructed networks and real networks show that the proposed method outperforms other related approaches.


Introduction
Complex network, as an e ective method of studying complexity of systems, provides theoretical tools for researchers to explore their objects from a new perspective.
e study of complex networks has been penetrated into many disciplines, such as computer science, control theory, medicine, biology, and economics. Berahmand et al. [1] adopted complex network methods to detect protein-protein interactions by constructed protein networks. Yao et al. [2] employed complex networks to characterize EEG signals and extract EEG signal information for emotion recognition. Further researching of important nodes mining has always been the focus of research in the eld of complex networks. In general, the function of the network mainly depends on several important nodes which have a huge impact on the structure and function of the network, even though the number of these nodes is relatively small. For example, in the criminal relationship network, the leader of the criminal gang can be quickly located. In the power network, some important circuit breakers and power generation units are protected to prevent large-scale power outages, which can be caused by the failure of these important units. In the research of search engines, the pages presented to users are sorted by their importance. In the prevention of virus spreading, treatments are more likely to be carried out on key patients. ese cases imply that the mining of key nodes is of great signi cance [3][4][5].
Existing methods on the identi cation of important nodes can be roughly summarized into three categories.
(1) Methods based on social network analysis. e main idea of the methods based on social networks is to measure the importance of nodes according to some statistical indicators based on graph theory, such as degree centrality [6], betweenness centrality [7], and closeness centrality [8]. PageRank [9], and HITS [10] have improved this kind of method and have adopted them in the design of research engines. (2) Methods based on systems science. e main idea of the methods based on systems science believes that node importance is equivalent to network destructiveness, and it can be measured by the degree of damage to the network after removing the key nodes. Typical methods include the neighborhood coreness proposed by Bea et al. [11], and the K-Shell proposed by Kitsak [12]. (3) Methods based on information theory. As a basic concept of information theory, entropy has been widely used in complex networks in recent years. Fei et al. [13] applied information entropy to the identification of important nodes in complex systems. Zhang et al. [14] adopted local structural entropy to measure the importance of nodes.
e single-indicator methods, showing advantages in some aspect, but they have many limitations. erefore, scholars have synthesized some of these methods to generate a fusion model. Berahmand et al. [15] proposed an important node identification method, DCL, which considered comprehensively a variety of properties of the node. It includes the degree of the node itself, the degrees of the neighbor nodes, the clustering coefficient, and the common edge between the neighbor nodes. Liu et al. [16] proposed an entropy-based node importance evaluation approach to obtain an accurate result, which takes into account not only the importance of the node itself, but also the relative importance of the node to its neighbors treating the entire network as a unit. Mester et al. [17] adopted the traditional centralized measurement method and community detection of clustering to analyze the node importance from multiple perspectives. Qiu et al. [18] proposed a node importance measurement method consisting of local and global positions of nodes. Berahmand et al. [19] proposed a method that integrated the degree of nodes, the negative effects of clustering coefficients, and the positive effects of secondorder clustering coefficients to define the importance of nodes.
Although recent works have focused on multiple perspectives to improve the effect of key node identification, these approaches are not always applicable to some specific networks. erefore, it is desirable to improve the accuracy of the identifying model and decrease the time complexity [20].
In this paper, we proposed a novel method (EC) based on local structure entropy and clustering coefficient in order to measure the importance of the node, which integrates the degree of the node and its first-order neighbors and closeness between the node and its neighbors. We have conducted experiments and evaluated the accuracy with accepted criteria [21]. e results of the experiments on three constructed networks and eight real networks with different sizes demonstrate that our approach outperforms others, especially in real datasets.

Preliminary
e notation used in this paper is summarized in Table 1. Complex networks can be described as an undirected graph G � (V, E), with a node set V � v 1 , v 2 , v 3 , . . . , v n and an edge set E � e 1 , e 2 , e 3 , . . . , e m , in which, |V| � n, |E| � m, and the edge between two nodes can also be denoted as (v i , v j ). [22][23][24] is widely used in thermodynamics to describe the process of heat conduction, and the essence of entropy is the internal chaos of the system. With the development of statistics and informatics, the meaning of entropy has been expanded. In information theory, Shannon entropy [25], the basis of information theory, is used to measure the unpredictability of information or the uncertainty of information systems. Zhang et al. proposed the concept of local structural entropy [14], whose main idea is to use the local structural property of nodes in the whole network rather than the property of the node itself. And the definition of local structural entropy is

Local Structural Entropy. Entropy
where LE i represents the local structural entropy of node i, node j is the neighbor of node i, n is the number of network nodes, and p ij is the ratio of the degree of node j to the degree of all nodes in the local network corresponding to node i. Hence, p ij is defined as follows: . (2)

Clustering Coefficient.
In graph theory, the clustering coefficient is a value adopted to describe the level to which nodes tend to cluster together in a graph. To be more specific, it reveals the degree of connection between neighbors of a node. Clustering coefficient, which describes the proportion of neighbors of nodes [26], can be defined as where CC i is the clustering coefficient of node i, E i is the number of triangles between node i and its neighbors, and k i , which is the number of neighbors of node i, is represented as where δ ij denotes the connection between node i and node j. Generally, the degree of a node is used to measure its importance. A node has greater influence if it has more neighbors.
at is, the importance of a node is directly related to its degree. However, the methods based on degree centrality only takes into account the rst-order relation of the node and its neighbors, ignoring the second-order relation of the node and its neighbor's neighbor, although the second-order is an important property in re ecting the node's local information.
Local structural entropy, by taking into consideration the impact of neighbors for measuring the importance of nodes, achieves better e ects on key nodes identi cation. However, it has low accuracy in the case of high clustering networks.
As shown in Figure 1, it is obvious that deleting node e is more destructive to the whole network than deleting node c, that is, node e has a more important in uence than node c. In Figure 2, the total degree of the local network corresponding to node c is 11, the degree of nodes a, b, c, d is 2, 3, 3, and 3, respectively. According to the de nition of p ij , we can obtain the ratio of the degree of these nodes to the total degree of the local network, here p ca 2/11, p cb 3/11, p cc 3/11, and p c d 3/11. We can also get the local structural entropy of these nodes according to the de nition of LE i , here LE a 1.0822, LE b 1.3730, LE c 1.3730, and LE d 1.3863. Similarly, we can obtain the values in the local network corresponding to node e in Figure 3. It is notable from the calculation that, LE c > LE e , here LE e 1.2945. e    result illustrates that it is not accurate to identify the inuential node when adopting LE i value only. e above example shows that local structural entropy has better performance than degree centrality in measuring some network characteristics [14]. However, local structural entropy lacks of accuracy in high clustered networks.

Method
According to the analysis of the above section, an accurate measurement of the importance of a node cannot depend on local structural entropy only, although it has better e ects than degree centrality.

EC Model.
e proposed method, EC, based on both local structural entropy and clustering coe cient, takes into account not only the node itself, but also the structure of its neighbors. e process of EC approach is described in Figure 4: And it can be represented as follows: where EC i is used to measure the importance of node i, LE i is the local structural entropy of node i, gnorm i is a normalization of g i , and g i is a matching factor for integrating clustering coe cient with local structural entropy of the node i. ese items are de ned as follows: where CC i , which is the clustering coe cient of node i, re ects the closeness of its neighbors. While LE i indicates the structural property of the node and its neighbors.
gnorm i is a min-max normalization of g i . To construct EC i by parameters (LE i and gnorm i ) with di erent properties, we adopt u(x) function to standardize these two parameters. e de nition of u(x) is We adopt the method proposed in the previous section to calculate the importance of all the nodes in the network illustrated in Figure 1. From (7), we get Table 2, in which EC c 0.5822 and EC e 0.7578. It is obvious that EC c < EC e , in other words, the importance of node e is greater than that of node c in our proposed method, which shows that the proposed method is more appropriate in the identi cation of key nodes with the local structure and neighbors' closeness of a node in complex networks being taken into consideration.

Evaluation Criteria.
Robustness can be used for measuring the functional change of networks when some nodes are removed. Generally speaking, the function of a network is a ected by its maximum connected component. In the experimental simulation of network robustness, we rst      removed the edges between the node and its neighbors one after another, according to the descending order of node importance. e removal of edges between nodes can decrease the connectivity of a network. After deleting the edges corresponding to the node, the lower the connectivity of the graph is, the greater the influence of the node becomes [28]. As for evaluation criteria, two values, denoted as ξ and τ were adopted. Where ξ is the ratio of the number of subgraphs to the number of nodes after deleting the edges related to the node, and τ is the ratio of the maximum subgraph size to the number of nodes. e node deletion order is consistent with the descending order of its importance. where and |V| is the total number of nodes in network G, S is the subgraph set of the network, |S| is the size of the set S i , and M s is the maximum size of the set S i . e evaluation values with large ξ and small τ mean that the deletion of the node will reduce the connectivity of the network, hence, the node has a great influence.

Experiments on ER, WS, And BA Networks.
We have constructed ER network, WS network, and BA networks for experiments and compared our approach EC with KS, LE, DC, DCL, and Cen, by which the nodes are sorted in descending order of its importance. To describe the effect of different methods, we only selected a specific deletion range to illustrate the issues.
e ER network has 500 nodes with a link probability of 0.4. As shown in Figures 5 and 6, the evaluation of each index is almost the same, which is due to the stable structure of the ER random network. e WS network in Figures 7 and 8 has 5,800 nodes while each node has five neighbors, and the random reconnection probability is 0.5. e BA network in Figures 9 and 10 has 8,800 nodes, which add 18 edges in each construction iteration. As can be seen from Figures 7 and 8, the Cen method is slightly better than the EC method, but the difference is not obvious, same as DC in Figures 9 and 10. It is shown that the Cen and the EC can reach the results more accurately than DC, LE, KS, and DCL. e BA network is constructed by 8,800 nodes with 18 edges for each node. ere are little gaps among the methods except KS when 70% nodes are removed. So, these methods almost have the same performance on constructed network, which is due to their randomness of the construction process.

Experiments on Real Networks.
In order to generalize the method to real networks, eight real datasets of different sizes were selected for experiments of deliberate attack simulation. After deleting the top ratio nodes in the ordered sequence, the statistical characteristics of each network are listed in Table 3. Where, |V| is the number of nodes, |E| is the number of edges, d max is the maximum degree of nodes in the network, d avg is the average degree, K avg is the average local clustering coefficient, K is the global clustering coefficient, and |T| represents the number of triangles. Figures 11 and 12 show the change of τ and ξ when the first 25%-35% nodes of the network email-enron-only are removed. As can be seen from Figure 11, EC is significantly superior to LE, DC, KS, and Cen. When the first 34% nodes are removed, the values corresponding to LE, DC, KS, DCL, Cen, and EC are 0.6014, 0.4545, 0.6294, 0.4196, 0.4825, and 0.2867, respectively. In Figure 11, the τ of EC is significantly lower than that of others, and the ξ of EC is higher than that of others except DCL. It means that after deleting certain ratio of key nodes identified by EC, the connected network is separated into several parts which have a large number of subgraphs with each subgraph contains a small number of nodes. Compared with DCL, the ξ of EC does not exceed that of DCL, though with slight deference. However, EC has remarkable performance than DCL on τ. So generally, we believe EC outperforms others. Figures 13 and 14 show the changes of τ and ξ on the power-662-bus network. It can be seen that after deleting top 20% nodes, the result of EC is better than that of others. It also indicates that the methods of using multi-indicator are more accurate than those of using single-indicator. We find KS has the worst result and the performance of other methods are between EC and KS. Figures 15 and 16 show the result of the experiment on power-bcspwr09 network. When we delete top 20% nodes, the curves of τ produced by these methods have a small fluctuation, but the trends of ξ are almost consistent. It is hard to distinguish which method is the best one on this network. On inf-openflights network as shown in Figures 17 and 18, we can obviously find that EC has the best result both on τ and ξ.
e results on other networks, such as power-US-Grid network (Figures 19 and 20), power-bcspwr10 network ( Figures 21 and 22), high-energy theory network (Figures 23  and 24), and inf-roadNet-CA network (Figures 25 and 26) show that even the performances of other methods alternate frequently, with EC always maintaining high accuracy. erefore, our method is available for many kinds of largescale networks. e results of the deliberate attack simulation on eight real networks of different sizes show that, compared with the five methods of LE, DC, KS, DCL, and Cen, the ξ curve of the EC method has the fastest growth and the τ curve has the fastest decline in most cases. It indicates that the deletion of the key nodes identified by the EC approach can lead the network to a serious damage, that is, EC can accurately measure the importance of nodes.

Conclusion
e paper proposed a novel method EC, which focuses on the degree of the node and its neighbors and the closeness of the neighbors. It can measure the importance of nodes more Mathematical Problems in Engineering efficiently and can be used to analyze the reliability and invulnerability of large-scale networks.
e results of experiments on datasets from three constructed networks and eight real networks indicate that attacks on the top part of nodes sorted by EC are more likely to increase the extent of the damage to the entire network. From experiments, it can be analysed that the rising of subgraph number and the decline of nodes number of giant components which are calculated by EC method are more obvious than those of other methods in most cases. erefore, EC has its superiority and effectiveness compared to methods of LE, DC, KS, DCL, and Cen. However, we discovered that the accuracy will be reduced in the networks with high clustering. In future research, we will resolve the deficiency to increase the accuracy in high clustering networks, and take into account the network properties of large-scale, high dimension, and dynamics, which are the challenges in the research on identifying key nodes of complex networks.
Data Availability e data that support the findings of this study are openly available in https://networkrepository.com/networks.php.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this manuscript.