Approximation of interactive betweenness centrality in large complex networks

The analysis real-world systems through the lens of complex networks often requires a node importance function. While many such views on importance exist, a frequently-used global node importance measure is betweenness centrality, quantifying the number of times a node occurs on all shortest paths in a network. This centrality of nodes often significantly depends on the presence of nodes in the network; once a node is missing, e.g. due to a failure, other nodes’ centrality values can change dramatically. This observation is, for instance, important when dismantling a network: Instead of removing the nodes in decreasing order of their static betweenness, recomputing the betweenness after a removal creates tremendously stronger attacks, as has been shown in recent research. This process is referred to as interactive betweenness centrality. Nevertheless, very few studies compute the interactive betweenness centrality, given its high computational costs, a worstcase runtime complexity of O(N**4) in the number of nodes in the network. In this study, we address the research questions, whether approximations of interactive betwenness centrality can be obtained with reduction of computational costs; and how much quality/accuracy needs to be traded in order to obtain a significant reduction. At the heart of our interactive betweenness approximation framework, we use a set of established betweeness approximation techniques, which come with a wide range of parameter settings. Given that we are interested in the top-ranked node(s) for interactive dismantling, we tune these methods accordingly. Moreover, we explore the idea of batch removal, where groups of top-k ranked nodes are removed before recomputation of betweenness centrality values. Our experiments on real-world and random networks show that specific variants of the approximate interactive betweenness framework allow for a speedup of two orders of magnitude, compared to the exact computation, while obtaining near-optimal results. This work contributes to the analysis of complex network phenomena, with a particular focus on obtaining scalable techniques.


I. INTRODUCTION
Complex network theory provides powerful tools to understand the structures and dynamics of many complex systems. Essentially, these systems are being modelled as nodes representing entities and links representing dependencies between entities. Much research effort has been spent on understanding different types of critical infrastructure systems, e.g. energy [1,2], communication [3,4], air transportation [5][6][7][8], railway [9], as well as social networks [10]. The phenomena and processes analyzed on these networks varies by study, including resilience analysis [11,12], delay/information spreading [13][14][15], growth pattern analysis and many others. Nevertheless, at the heart of many analysis tasks is the problem of identifying node importance, i.e. a quantification of the relative value of a node in a network. Indeed, it is significant to identify the extremely important nodes which maintain the structure and function of the network.
These node importance values vary for two reasons. First, the importance can be measured regarding different perspectives of importance; preferring local vs global or topological vs flow-like views. Depending on the chosen view, many different node centrality measures have been proposed, including degree centrality, closeness centrality [16], eigenvector centrality [17], Katz centrality [18] and betweenness centrality [19]. Second, the importance of a node often depends significantly on the presence of other nodes in the network. For a pair of nodes with redundant function, e.g., regarding propagation, one node can become significantly more important in the absence of the other node. This effect is visualized in Figure 1. Initially, node 9 is not important in the network. However, once node 14 fails, the majority of flow in the network is routed via node 9, since all flows have to go through the remaining path on the right-hand side. Accordingly, a very small change in the network, here referring to the failure of a node, can change the node importance significantly.
Existing methods usually do not take into account this dependency of node importance values, mainly because of limited computational resources. For instance, computing exact betwenness centrality values of each node in a network, has a worst-case time complexity cubic in the number of nodes, since essentially, all pairs of shortest path between all nodes have to be computed [20]. Computing the interactive betweenness centrality requires to recompute the betweenness centrality after each node removal, increasing the worst-case time complexity to being quartic in the number of nodes in the network, i.e. O(N 4 ). Such a high computational complexity inhibits computations on even medium-sized networks, given that increasing the size of a network by a factor of 10, will increase the required computational resources by a factor of 10,000. While static betweenness centrality computations can be speed-up significantly by parallelization [21], interactive betweenness centrality cannot be further accelerated, given the dependency of choices between each attack step: the subnetwork at step  + 1 is only determined, once the to-be-removed node at step  is fixed.
In this study, we aim to explore possibilities for computing an approximation of the interactive betweenness centrality for larger networks. To achieve this goal, we devise an estimation framework. We exploit betweenness approximation techniques for selecting outstandingly important nodes in a network. There are several widely-used static betweenness approximation methods, which come with whole range of parameters. Moreover, in order to avoid re-computing the approximate betweenness at each iteration, we select a number of outstanding nodes (not only one) on-the-fly. Experiments on random and real-world networks show that this strategy computes rankings very similar to those obtained by exact interactive betweenness computation. Moreover, experiments on network dismantling show that the results obtained by approximation of interactive betweenness are close to those of interactive betweenness, but at much lower runtime requirements. Our work contributes to the analysis of complex network phenomena, with a particular focus on obtaining scalable techniques.

A. The overall framework
We devised an interactive betweenness approximation framework consisting of a set of static betweenness approximation algorithms and different selections of k-batch removal (remove k nodes with high betweenness before recomputation). The exact interactive computation is recomputing betweenness values after removal of the Top-1 node. However, the time complexity of static betweenness computation is O(N * E), where N is the number of nodes and E is the number of edges in the network, which is prohibitive for large networks and makes the interactive computation more expensive. To reduce the computational costs, we exploit approximation methods since such methods can make trade-offs between speed and identification of high betweenness nodes. Note that, the identification of the highest betweenness node is the core part of interactive computation. Moreover, we also considered selection of k (i.e., choice of how many nodes to remove in each iteration): Instead of removing a single node with the highest betweenness, batch removal reduces the number of iterations of interactive computation. Based on the above ideas, our framework has two core parts: 1. Static betweenness estimation: Compute estimated betweenness values of all nodes in the current GCC (giant connected component) of the network.
2. Selection of batch removal: Obtain Top-k ranked nodes and remove them from the network, then go back to 1).
In part 1), the approximation algorithms estimate betweenness values of each node in the current GCC. The accuracy of approximation affects the quality of such interactive computation: If the approximation method cannot identify the Top-k nodes correctly, it will lead to continuous errors in subsequent iterations, which propagate and often become worse with an increasing number of nodes. Therefore, we need to select approximation methods with nice trade-offs between quality and runtime. In part 2), the selection of the parameter k is also worth considering. On the one hand, if we choose small k, we will get better quality. One of the most extreme choices is setting k = 1, that is, recomputing betweenness after each node's removal, which will be extremely time consuming but can get exact results. On the other hand, if we set k very large, we can reduce runtime for the price of deteriorated quality. In the best case, the k value is chosen adaptively in each iteration, as there may be only one or many high betweenness nodes in the current GCC. To sum up, our interactive betweenness approximation framework focuses on the selections of approximation methods together with the number and size of batch removal.

B. Static Betweenness Approximation
The existing algorithms for betweenness approximation compute an estimation of the static betweenness of all nodes in the network. Since all the approximation methods are based on Brandes' algorithm, we revisit this algorithm first. For a node pair (s, t), [20] defines the pair-dependency on node , denoted by δ st () and dependency of node s on node , denoted by δ s• () as: In addition, Brandes proved that δ s• () obeys: where P s (t) represents all parents of t on the breadth-first-search (BFS) from s. Based on these, the betweenness value of  B() can be computed by B() = s∈G,s = δ s• (). That is, given a network with N nodes and E edges, a single BFS from one source node s can compute the dependency of each node which takes O(E) time. To obtain betweenness of all nodes, each node of the network should be set as a source node and it requires N iterations of BFS. In total, the computation of exact static betweenness of all nodes needs O(N * E) time, which is quite expensive for large networks. Besides, for dense networks with E ≈ N 2 , as the worst case, the time complexity is O (N 3 ).
To reduce the computational cost, approximation methods compute a subset of node dependencies or pair dependencies instead of the set of all dependencies required by the exact computation. Different strategies for selecting the subset constitute several approximating methods. In general, there are three classifications: 1. Pivots sampling: Such methods conduct BFS from a subset of source nodes, called pivots, and compute node dependencies on each node from selected pivots.
2. Node pairs sampling: Instead of considering node dependencies, such methods sample pairs of nodes and compute the pair dependencies on each node from selected node pairs.
3. Bounded BFS: Such methods change the stop condition of BFS and only consider a subset of shortest paths.
Besides these three classifications, there are also some recent methods for betweenness approximation, including sparse-modeling based method [22], MPI-based adaptive sampling method [23] and GNN-based method [24]. More details for different static approximating methods and parameter settings are in Appendix A.

C. Choosing the size of batch removal
In this subsection, we describe more details on how to determine k based on current GCC in each iteration. If k is small, few nodes are removed from current GCC, which requires more iterations and higher computational costs. On the contrary, if k is large, the computational costs will be reduced but the quality decreases since many removed nodes have lost their importance. Therefore, we need to make a trade-off between quality and speed. Note that, it is more reasonable to choose an adaptive k value based on the number of particularly central nodes in each iteration. Firstly, we need to roughly estimate the range of k for different networks. We selected k = 1 and conducted experiments of the interactive exact betweenness computation on 48 real-world networks with diverse sizes. We visualize the distribution of p 50% (the number of nodes that need to be removed to get 50% GCC reduction) in Figure 2. As the left subplot shows, some larger networks can be cut into 50% with removal of few nodes (e.g., removing no more than 10 nodes can get 50% reduction on a network with size > 10,000). Besides, the distribution of p 50% in right subplot indicates that removal of no more than 50 nodes can cause 50% GCC reduction on many networks. When being interested in fixed-size batch removal, we set k ∈ [1,2,4,8,16].  We can see that in I4 (Interactive 4th attack) and IR (Interactive remainder), there are 2 nodes with high betweenness (e.g., node 2 and 3 with betweenness value = 0.5 in IR) and these two nodes are of the same importance. we can remove both of them in one iteration to break up the GCC. In I3 (Interactive 3rd attack), there is only one node (i.e. node 5) with high betweenness value (0.5), that is, there is only one particularly central node. For such an outstandingly important node, it is reasonable to set k = 1 and only remove the single node from GCC. Inspired by the example network, we, in addition, consider setting k to be the number of nodes with betweenness ≥ 0.5 and make it adaptive in the range [1, 2,4,8,16]. Besides, we also consider the case of setting k to be the number of nodes with betweenness ≥ (average + standard deviation of betweenness values). Besides, we can remove certain percentage of nodes in each iteration. For the remaining experiments, we selected 1%, 5%, 10% and 20%. To sum up, we determine k value in each iteration based on the distribution of betweenness values of nodes in current GCC. Table I shows an overview of k settings for batch removal.

D. Measures for comparison
Accuracy: Given an approximation algorithm and certain k setting, the output of our framework is a ranking of nodes from higher interactive betweenness to lower interactive betweenness. To analyze the approximated ranking, we considered four aspects:  2. Ranking sortedness: Compared with the exact ranking, the sortedness of approximated ranking can be described by inversion number.
3. Weighted coefficient: Considering the importance of top ranked nodes, we used Weightedtau to add the weight of exchanges between top ranked nodes.
4. Destructive to the network: During interactive computation, the size of GCC keeps decreasing as we keep removing nodes. A good method can identify nodes with high betweenness, which could have great impact on network connectivity, resulting in a quick dismantling process and a fast GCC reduction. We considered the number of nodes need to be removed to cut GCC into 10%.
In total, we devise six measures to evaluate the accuracy compared to the standard ranking (i.e. the ranking of nodes of exact computation with k = 1) as follows: 1. Top-1%-Hits: The fraction of nodes correctly identified by approximate methods in Top-1% nodes.

5.
Weightedtau: A node with rank a is mapped to weight 1/(a+1) and an exchange between two nodes with rank a and b has weight 1/(a+1)+1/(b+1). That is, top ranked nodes have higher weights, which increase the impact of the exchanges between important nodes.
6. 10% GCC reduction (p 10% ): it represents how many nodes the method requires to remove to dismantle the network until GCC < 10%*N. The normalized value p 10% is mapped to [0, 1] and 1 means the method which needs the minimum number of nodes to get 10% GCC reduction.

BA Barabási-Albert
The number of nodes n ∈ {300,700,1000}, BA n m The number of edges to attach from a new node to existing nodes m∈ {2, 4, 6}.

Runtime:
We conduct experiments on the same computer with four i7-6500U cores (2.50GHz) and 16 GB RAM. We run each approximate method independently and recorded the exact runtime.
Trade-offs: Considering six measures of accuracy, we normalized the runtime and plotted it with normalized measures to see which method can offer a nice trade-off. In order to analyze the results on different networks, we computed the average normalized runtime and measure values. To sum up, we use six measures to evaluate accuracy and we also analyzed runtime and trade-offs. Besides, we set the naming schedule as approximation algorithm parameter k (e.g. RAND2 64 2 represents using RAND2 algorithm with number of pivots = 64 and k = 2). First, we generated 9 ER (Erdos-Renyi) graphs, 9 BA (Barabási-Albert) graphs and 27 WS (Watts-Strogatz smallworld) graphs with different sizes and parameters. Table II provides an overview of our random networks and generator parameters. Figure 4 visualizes four selected random networks. On these random graphs, we performed sensitivity analysis of Top-1-node-identification for three selected methods in order to select reasonable parameters (see below). In addition, we selected 48 real-world networks of different sizes and structures, covering a variety of domains, as obtained from http://networkrepository.com/networks.php:

A. Networks in this study
• Social (4 networks): Networks showing the social friendships between people. Nodes are persons and edges represent their connections.
• Biological (5 networks): Networks showing the interactions between elements in biological systems.
• Brain (7 networks): Networks representing functional connectivity in brains. We chose different brain networks of mouse, macaque and fly.
• Web (5 networks): Networks representing the hyperlinks between pages of the World Wide Web.
• Email (2 networks): Networks showing mail contacts between two addresses.
• Cheminformatics (2 networks): Networks reflecting the chemical interactions of materials. Table III shows an overview of our 48 real-world data sets, including network properties.

B. Sensitivity analysis / parameter selection
In order to selected reasonable parameters of approximation methods, we evaluated the quality of identification of the Top-1 node with each selected method by computing static betweenness with each method on our generated random networks. Figure 5 reports thethe fraction of networks on which each competitor can correctly identify the Top-1 node.
RAND2: Figure 5 shows the results of identifying the Top-1 node on random networks regarding different number of sampled pivots. It can be seen that the quality (measured as ratio of correctly identified nodes) increases with the number of pivots and sampling with 512 pivots is the best. RAND2 64 can be chosen as a trade-off one and it correctly identifies over 70% WS networks and saves much time.  RK: As Figure 5 indicates, it can get the best quality when we set ε = 0.07. However, it only identified 60% of all random graphs. As RK with ε = 0.2 and 0.3 cannot identify Top-1 node in all ER networks, we chose ε = 0.07 and 0.1.
KPATH: The results of KPATH are shown in Figure 5. The quality is the worst compared to RAND2 and RK. KPATH 0.2 4 and KPATH 0.2 8 is reasonable among all selected settings.

C. Accuracy
Since the computation on real-world networks is expensive, we analyzed results on generated random graphs first, in order to select competitors and conduct further experiments on real-world networks. Figure 6 presents the average measure values of 66 competitors. We can see that RAND2 512 1 offers the highest accuracy in general. Figure 7 presents the distribution of 10% GCC reduction of 66 competitors. We can see that on 10% GCC reduction measure, which is closely related to dismantling problem, the quality of RAND2 64 1 is also good. Moreover, the accuracy of RAND2 64 is closed to RAND2 512 with different k. Real-world networks: We run experiments on 48 real-world networks and computed six measure values of accuracy. Figure 8 presents the distribution of measure values. We can see that RAND2 64 with k = 1 is outstanding on all measures. Besides, when setting constant k values, the quality becomes worse as we increase the k value. On measure p 10% (GCC 10% reduction), it is clear that the quality becomes worse from k = 1% to k = 20%. Besides, RAND2 64 and RK 0.10 0.1 with removing certain nodes with B ≥ 0.5 can also offer good accuracy. Compared to RAND2 64 and RK 0.10 0.1, the quality of KPATH 0.2 4 is not good. We computed the average measure values

D. Runtime
The runtime of computing interactive betweenness depends on the size of the network, choices of k and selected approximation algorithms. We analyzed the runtime regarding different k values with the same approximation method. Besides, we evaluated the runtime of different approximation methods with the same k setting.
Runtime regrading different k settings: Figure 10 plots the runtime (in seconds) of RAND2 64, RK 0.10 0.1 and KPATH 0.2 4 with the same k setting (i.e., k = certain nodes with B ≥ 0.5) on different real-world networks with y-axis = runtime in seconds and x-axis = NlogN where N is the number of nodes in the network. We can see that Figure 10 shows that the time complexity is around O(NogN) for these sparse real-world networks. Note that, for dense networks, the runtime will be nearly O(N 2 ) theoretically. Moreover, the runtime of RAND2 64 is the highest. While, KPATH 0.2 4 is the fastest among three approximation methods but it does not offer a nice quality.
Runtime regrading different approximation methods: Figure 11 shows the runtime of RK 0.10 0.1 with k from 1 to 16. We can see that the runtime increases as k decreases. If we choose smaller k, then fewer nodes are removed in each iteration, resulting in larger number of iterations and computational costs. Besides, doubling the Top-1%-Hits  k value will save 50% runtime when k ≥ 2. When k = 1, the runtime reaches its upper bound and is not doubled compared to k = 2.
To sum up, on sparse real-world networks, the actual runtime is O(NogN) from our results and has a linear relationship with k ≥ 2.

E. Speedup
In this subsection, we present the speedup of interactive betweenness approximation compared to the standard BETWI (exact interactive betweenness computation). Based on our experimental results of BETWI and other approximation methods conducted on the same computer, we computed the speedups on three ER networks with different sizes (N = 300, 400, 500, 600, 700, 800, 900 and 1000) but the same generator parameter p = 0.02. Similar to our analysis of runtime, our evaluation on the speedups of interactive betweenness approximation sheds light on two aspects, the speedups of different betweenness approximation algorithms and the speedups with increasing k. Figure 12(left) shows the speedups of RAND2 512, RAND2 64, RK 0.07 0.1, RK 0.10 0.1, KPATH 0.2 4 and KPATH 0.2 8 with the same k setting. We can see that the speedup increases as the network becomes larger. As a fast algorithm, KPATH offers great speedups compared to RK and RAND2. Figure 12(right) presents the speedups with different k settings. Removing one node from GCC in each iteration induces low speedups. While, doubling the k value approximately doubles the speedup.

F. Trade-offs
From the results on quality and runtime, some competitors (e.g. RAND2 64 0.5) gets high quality but need hours on the largest network. Several methods (e.g. KPATH 0.2 4 with k = 16) is quite fast but the quality is bad. In this subsection we focus on the trade-offs of these selected competitors.
Trade-offs on specific networks: Figure 13 presents trade-offs between quality (i.e., the values of six measures of accuracy) and speed (exact runtime). We used 3 colors to distinguish 3 approximation methods and 3 markers to label 3 typical k settings, including a faster one k = 16, a slowest one k = 1 and k = 4 as a trade-off. We can see that the RK 0.10 0.1 with k = 4 gets nice trade-offs on Top-1%-Hits, taking no more than 25% runtime to get high accuracy compared to the maximum runtime. While, when we consider Inversion measure, KPATH 0.2 4 with k = 1 is good. In addition, RAND2 64 with k = 4 also provides pleasurable trade-offs on both Top-1%-Hits and Weightedtau.
Average trade-offs: As there are orders of magnitude deviation between runtime on different networks, to analyze the trade-offs on 48 real-world networks, we normalized the runtime on each network with the slowest = 1 and the fastest = 0 and then we computed the average normalized runtime on 48 networks. Besides, we further normalized the measure values to map these values into [0, 1] on each network and computed the average normalized measure values. Figure 14 shows the results. We used the same labels as Figure 13 and add legends of these competitors with average normalized runtime ≤ 0.5 and average normalized measure values ≥ 0.6. We can see that setting k in [2, 4, 1%, 0.5] can make good trade-offs with specific approximation methods.

IV. CONCLUSIONS
Betweenness centrality is a widely used measure of node importance, which counts the number of shortest paths a node appears in a network. However, if one node in the network is being attacked or loses its functionality, the betweenness values of other nodes will change. That is, all betweenness values need to be recomputed, in order to update the actual node importance. Recent research suggests that, on network dismantling problem, interactively removing one node with the highest betweenness outperforms removing nodes based on ranking obtained by one time betweenness computation. However, the interactive betweenness computation requires static betweenness recomputation on current GCC after each node removal and it is significantly more computationally expensive (an order of magnitude) compared to static approaches.
In this paper, we systematically investigate approximation of interactive betweenness centrality. We proposed a framework for interactive betweenness estimation with k-batch removal. Our framework consists of a set of static betweenness approximation algorithms with various parameter settings for identifying top nodes with high betweenness and selections of how many nodes to be removed in each iteration. In other words, we not only analyzed the performance of removing one top node but also evaluated the removal of a batch of nodes. As the computation of interactive betweenness is more expensive than the computation of static betweenness, we focus on choosing approximation methods with parameter settings and k values (the number of nodes to be removed in each iteration) which can offer high quality and also a nice trade-off between accuracy and speed. To ensure that our data sets cover different network structures, we generated 45 random networks, including ER, WS and BA networks, and selected 48 real-world networks with distinct sizes from different fields. We devised six measures to evaluate accuracy with considering the identification of important node, the similarity of rankings and the effects on GCC reduction.
To make preliminary selections of suitable parameter settings, we conducted sensitivity analysis of static betweenness approximation algorithms and evaluated the quality of identification of Top-1 node on random networks. We selected six approximation methods, consisting of RAND2 64, RAND2 512, RK 0.07 0.1, RK 0.10 0.1, KPATH 0.2 4 and KPATH 0.2 8. As for k settings, based on the results of 50% GCC reduction, we found that many networks can be dismantled with a small fraction of nodes and we choose 11 different k setting (k in [1,2,4,8,16, 1%, 5%, 10%, 20%, AS, 0.5]). We run tests with 66 competitors (six approximation algorithms with 11 k settings) on random networks to further select competitors. Based on the results on random networks, we chose RAND2 64, RK 0.10 0.1 and KPATH 0.2 4 with 11 k settings and conducted experiments on larger real-world networks. We found that RAND2 64 1, RAND2 64 0.5, RK 0.10 0.1 1 and RK 0.10 0.1 0.5 offer high accuracy. Besides, we analyzed the runtime regarding different approximation algorithms and k settings. Our analysis on different approximation methods with the same k reveals that RAND2 64 is the slowest and KPATH 0.2 4 is the fastest competitor. Moreover, we also found that doubling k values will get 50% runtime reduction for k ≥ 2 and the runtime reaches its upper bound with k = 1 (shown by Figure 11). Our analysis on trade-offs indicates that RAND2 64 and RK 0.10 0.1 with k = 2, 4, 1% and 0.5 can offer nice trade-offs between accuracy and speed.
In synthesis, we have proposed a novel framework for interactive betweenness approximation. We systematically evaluated the selections of approximation algorithms with various parameter settings and the choices of different batch removals from three aspects: accuracy, runtime and trade-offs between them. Our work contributes to the analysis of complex network phenomena, with a particular focus on obtaining scalable techniques. Future work could investigate the interactive approximate computation of other network centrality measures. Pivots sampling: [25] introduced RAND1 for betweenness approximation. RAND1 samples a subset of source nodes uniformly at random and computes the estimated betweenness of all nodes with scaling it up by N |S| , where |S| is the number of sampled source nodes. [26] proposed GSIZE algorithm which determines the number of sample pivots by graph size. GSIZE utilizes an adaptive sampling technique which was introduced by [27]. Given a node , GSIZE keeps sampling pivot s until δ s• () > 5 · N. [28] proposed RAND2 based on random sampling to approximate static betweenness values of all nodes. RAND2 modifies RAND1 by scaling it up with a linear function. RAND2 decreases the contribution of nodes closed to the source nodes and can solve the overestimation problem of RAND1.
Node pairs sampling: [29] proposed a fully dynamic algorithm (DA) for computing estimated betweenness. DA keeps track of the old shortest paths and substitutes them only when they are necessary. [30] proposed RK which samples pairs of nodes instead of conducting BFS from sampled source nodes. RK is an (ε, δ) − ppromton: given the allowed additive error ε with probability 1 − δ, RK guarantees that the error of each node is less than ε with probability at least 1 − δ. RK determines the sample size by VC-dimension (Vapnik-Chervonenkis dimension) introduced by [31] instead of the network size. [32] presented an (ε, δ) − ppromton method ABRA. ABRA uses progressive sampling and sets the stop condition by utilizing Rademacher Averages proposed by [33] and pseudodimension introduced by [34] in statistical learning field.
Bounded BFS: [35] found that the betweenness of node  in its EGO network is related to the exact betweenness of  in the network. The ego network of  is composed of node  itself, all the neighbors of  and the edges that connects those nodes. [35] used neighbors with distance = 2 in their EGO approximation algorithm. In other words, EGO bounds the BFS with 2 hops from source nodes. [36] presented KPATH method which computes betweenness centrality values based on k-centrality measures. [36] assumed that these nodes distant from each other do not contribute to the betweenness values. Compared to [35], the BFS of KPATH is bounded by k hops from source nodes and these nodes with distance > k from the source node are not considered. [37] introduced an adaptive algorithm KADABRA, which can approximate betweenness of all nodes or just compute the Top-k nodes. KADABRA uses a balanced bidirectional BFS to sample shortest paths. Instead of conducting a full BFS from s to t, KADABRA performs a BFS from s and a BFS from t at the same time until such two BFSs touch each other.
As mentioned above, we divided approximation algorithms into three classifications: pivots sampling, node pairs sampling and bounded BFS. We selected one method with a nice trade-off between runtime and quality from each classification. For pivots sampling methods, we chose RAND2 as it offers an outstanding accuracy with a good trade-off. From an experimental perspective, the results of [38] and [39] both show that RAND2 outperforms other methods in tested networks. From a theoretical perspective, with linear scaling, RAND2 can handle the overestimation problem of RAND1. Thus, RAND2 can be selected as a representative of the methods based on pivots sampling. However, the performance of RAND2 is determined by sample size. As RAND2 needs |S| ( the number of sampled pivots) iterations of BFS, the time complexity of static RAND2 is O(|S| * E). On the one hand, if we sample few pivots, we can not identify the Top-1 node (the node with the highest betweenness) well. On the other hand, if we sample too many pivots, we will do redundant calculations. As [28] suggests, we just selected constant sample sizes |S| ∈ {8, 16, 32, 64, 128, 256, 512}. We selected RK proposed by [30] among node pairs sampling methods. The results on Top-1%-Hits in the benchmark provided by [38] indicate that RK is a better choice for identifying vital nodes. As an (ε, δ) − ppromton method, the ε can greatly affect the speed and quality by determining the sample size [40]: where VD(G) is estimated vertex-diameter of the network as the computation of exact vertex-diameter is quite expensive. Since we focus on identifying the Top-1 node for interactive approximation, we can set ε higher than 0.01 (default). We evaluated the performance of RK with ε ∈ {0.07, 0.1, 0.2, 0.3}. As for δ, we set it to be 0.1 (default). In addition, we chose KPATH introduced by [36] as a typical one among the methods based on bounded BFS. KPATH approximates static betweenness centrality values with using k-centrality measures. KPATH assumes that nodes distant from  offer zero dependencies. KPATH stops the BFS until reaching k hops. Therefore, only pair dependencies of two nodes with distance ≤ k can contribute to the betweenness values. KPATH determines sample size by parameter α: the number of samples is proportional to N 1−2 * α , where N is the number of nodes in the network. To distinguish k in KPATH from our k batch removal, we name k in KPATH k KPATH . We set k KPATH ∈ {4, 8}. Besides, considering the α values, we set it to be 0.0, 0.2 and 0.4 to make comprehensive comparison. For three selected methods with its parameter settings, we choose the naming scheme as: method parameter (e.g., KPATH 0.2 4 is KPATH method with α = 0.2 and k KPATH = 4). Table V presents an overview of our selected methods.