Identifying Super-Spreader Nodes in Complex Networks

Identifying the most influential individuals spreading information or infectious diseases can assist or hinder information dissemination, product exposure, and contagious disease detection. Hub nodes, high betweenness nodes, high closeness nodes, and high !-shell nodes have been identified as good initial spreaders, but efforts to use node diversity within network structures to measure spreading ability are few. Here we describe a two-step framework that combines global diversity and local features to identify the most influential network nodes. Results from susceptible-infected-recovered epidemic simulations indicate that our proposed method performs well and stably in single initial spreader scenarios associated with various complex network datasets.


2
Mathematical Problems in Engineering affected by specific individual nodes. Our assumption is that k-shell decomposition [1,2] can be used for global analysis, with high global diversity/high local centrality nodes capable of penetrating multiple global layers and influencing large numbers of neighbors in local layers of complex networks.
To measure node influence, we propose a two-step framework for acquiring global and local node information within complex networks. Global node information is initially obtained using algorithms (e.g., a community detection algorithm for complex networks [5,19,20] or a k-shell decomposition algorithm for core/periphery network layers), after which entropy is used to evaluate network node global diversity. Next, local node information is acquired using various types of local centrality. Last, global diversity and local features are combined to determine node influence. In our experiments, spreading ability equaled the total number of recovered nodes over time. We used a susceptibleinfective-recovered (SIR) epidemic simulation with various social network datasets [21][22][23][24][25] to compare the spreading capabilities of our proposed measure and social network local/global centralities [2,26,27].

Background
To represent a complex network, let an undirected graph = ( , ), where is the network node set and the edge set.
= | | indicates the number of network nodes and = | | the number of edges. Network structure is represented as an adjacency matrix = { } and ∈ , where = 1 if a link exists between nodes and , otherwise = 0.
Degree (or local) centrality is a simple yet effective method for measuring node influence in a complex network. Let ( ) denote node degree centrality. Higher values indicate larger numbers of connections between a node and its neighbors. NB ℎ ( ) denotes the set of node neighbors at a h-hop distance. Node degree centrality is therefore defined as where |NB ℎ ( )| is the number of node neighbors at a h-hop distance; in most cases, ℎ = 1 [7]. Betweenness centrality or dependency measures the proportion of shortest paths going through a node in a complex network. ( ) denotes node betweenness centrality. Higher values indicate that a complex network node is located along an important communication path. Accordingly, node betweenness centrality is defined as where ( ) is the number of shortest paths from node to node through node and is the total number of shortest paths from node to node [3,7,16].
Closeness (or global) centrality measures the average length of the shortest paths from one node to other nodes.
Let ( ) denote node closeness centrality. Higher values indicate node location in the center of a complex network, with a shorter average distance from that node to other nodes. Node closeness centrality is thus defined as where is the average length of the shortest paths from node to the other nodes and is the distance from node to node [16]. k-shell decomposition [1,2] iteratively assigns k-shell layer values to all nodes in a complex network. During the first step, let = 1 and remove all nodes where ( ) = = 1. Following removal, some remaining network node degrees may be = 1. Nodes are continuously pruned until there are no = 1 nodes. All removed nodes are assigned a k-shell value of 1. The next step is similar: let = 2, prune nodes, and assign a k-shell value of 2 to all removed nodes. Repeat the procedure until all network nodes are removed and assigned k-shell indexes. This method reveals the significant features of a complex network-for example, all Internet nodes can be classified as nuclei, peer-connected components, or isolated components [1].
The SIR epidemic model [2,26,27] is used in many fields to study the spreading processes of information, rumors, biological diseases, and other phenomena. The model consists of three states: susceptible ( ), infective ( ), and recovered ( ).
nodes are susceptible to information or diseases, nodes are capable of infecting neighbors, and nodes are immune and cannot be reinfected. Initially, almost all network nodes are in the set, with a small number of infected nodes acting as spreaders. During each time step, nodes infect their neighbors at a preestablished infection rate, after which they become recovered nodes at a recovery rate of . The total number of nodes in an SIR model is ( )+ ( )+ ( ) = , with ( ) denoting the number of susceptible nodes at time , ( ) the number of infected nodes at time , ( ) the number of recovered nodes at time , and ( ) = ( )/ the proportion of immune nodes.

The Proposed Measure
Our two-step method for obtaining global and local node information in a complex network is illustrated in the following steps. In step 1, global algorithms (e.g., community detection, graph clustering, and k-shell decomposition) are used to analyze the global features of nodes, and results are used to compute their global diversity. In step 2, degree centrality is used to measure local node features. Global diversity and local features are then combined to determine the influence of complex network nodes.
In step 1, k-shell decomposition was used as an example for obtaining global node information in a complex network, with Shannon's entropy [28] used to calculate node k-shell values and to determine how many network layers are affected by a node. According to (4), maximum entropy indicates a case in which a node is capable of connecting with all layers of a complex network, and minimum entropy (0) indicates a case in which all node connections are in the same network layer. The k-shell entropy of node , which ensures that its neighbors' k-shell values are significantly more diverse, is defined as where = {1, 2, . . . , max } are the k-shell values of the neighbors of node , ( ) the probability of the -core layer of neighbors, | | the number of nodes in the -core layer of the complex network, and̂( ) the normalized k-core entropy required for the case under consideration.
In step 2, the node's degree centrality is used to analyze the value of local features in the complex network; the degree centralities of neighbors are also considered. High influence values indicate high degree centralities of a node and its neighbors, meaning that the node is capable of reaching the widest possible local range. The local feature of node is defined as where ( ) is the degree centrality of neighbor and NB ℎ=1 ( ) is the node neighbor set at a h-hop distance. ( ) can be extended to become a "neighbor's neighbor" version, meaning that all node neighbors with a 2-hop distance are considered.
Finally, and are combined to denote , the final influence of node , defined as

Results and Discussion
Basic complex network properties and results from a network GCC structure analysis are shown in Table 1. We used three network dataset classifications: scientific collaboration, traditional social, and "other. " Measures were degree, betweenness, and closeness centralities; k-shell decomposition; neighbor's core (also known as coreness) [29]; PageRank [30]; and our proposed method. Spreading experiment and SIR epidemic model parameters were 1,000 simulations for each dataset, 50 time steps per simulation, and with the top-1 node for each measure serving as the initial spreader. infection rates are shown in Table 1. According to at least one study, a large infection rate makes no difference in terms of spreading measures [2]. To assign a suitable infection rate for each network dataset, rates were determined by comparing the theoretical epidemic threshold thd with the number used in referenced studies [29]. Recovery rate was always = 1, meaning that every node in set entered set immediately after infecting its neighbors. Experimental results and details are shown in Figure 1 and Table 2. We found that the leading group could be defined as the spreading result of measures that are larger than the maximum result minus an inaccuracy factor of 1%: where is the set of measures used in the experiment, max ( ) is the maximum result at time t, err is the inaccuracy rate (0.01), and time step = 50.
The number of recovered nodes ( ) was used to measure and rank the spreading capability of various measures. The leading group can help determine measure stability for identifying the influence of nodes in different networks. Measures inside the leading group had approximately the same spreading capability. The average rank shown in Table 2 was used to interpret the expected rank in different networks: a measure with a lower average rank was viewed as having better discrimination in terms of identifying good spreaders.
According to the inside leading group number (a measure stability indicator), our proposed method performed well in terms of identifying the most influential network nodes and thus is capable of identifying nodes that serve as good spreaders with global diversity in a complex network. In addition to being within the leading group, the method also had a better ranking compared to other measures within that group. The identified influence spreaders were capable of reaching large numbers of network nodes through their diverse global connections, of affecting network layers, and of exerting a maximum spreading effect. Our results also indicate that the degree centrality of a node and its neighbors can be used to maintain the number of contact nodes in the local layer of a complex network. However, important differences were noted among measures. For example, the closeness measure performed well in the top-1 position of the ca-HepTh and Email-Enron networks ( Figure 1, Table 2), but not in the ca-GrQc, jazz musician, or NetScience networks. Since the characteristic the measure wanted to capture may not have been sufficiently strong in those networks, the most influential spreaders could not be identified.
Although our proposed method underscores the robustness and stability of identifying the most influential nodes, we acknowledge two limitations. First, in cases of global node diversity and lower node degree centrality, the spreading capability of nodes is constrained and dependent on the degree centrality of their neighbors. The influence of a node is limited to the local layer of a complex network when the degree centrality of its neighbors is lower. The spreading range is also limited when a node's connected neighbors are located in the network's peripheral layer. However, the spreading range of nodes may be wide when the node's neighbors are located near the hub and within the core network layers and when information and ideas can still be spread to infect a large number of nodes throughout the network.   Mathematical Problems in Engineering   Second, maximum k-shell values are lower and network sizes are considerably smaller in the absence of global diversity in a complex network. As shown in Table 2, nodes with high global diversity in the Dolphins network could not be identified. In that case, the spreading ability of nodes identified by our proposed method decreased to the degree centrality (ignoring the first term), and the influence of nodes was limited to local network layers. In the absence of global diversity, (8) becomes ≈ , which favors local network layers (i.e., degree centrality). The spreading ranges of nodes were also limited to local network layers when nodes were located in peripheral layers or inside local and dense clusters. However, broad spreading ranges were observed for nodes located in the network's core layers [2]. In addition, thê( ) normalized global diversity values produced by our proposed method are similar to the participation coefficients reported by Teitelbaum et al. [31], and the high global diversity values of nodes that we observed are similar to those of connector hubs and kinless hubs, both of which have distinct participation coefficients.

Conclusion
Our plans are to add considerable detail to our analysis, to introduce a sophisticated method for evaluating spreading ability, and to clarify how the proposed method is affected by network structure. For example, global algorithms such as community detection algorithms can be used to analyze and obtain global information on community network structures and to determine how factors such as position and node role [31] affect the degree to which spreaders distribute information or diseases throughout a complex network. We also plan to study strategies associated with multiple initial spreaders in networks. Since overlapping infected areas for selected spreaders must be minimized [2], a multiple initial spreader scenario may either accelerate or hinder spreading within a complex network.