Competition-Based Benchmarking of Influence Ranking Methods in Social Networks

. The development of new methods to identify in ﬂ uential spreaders in complex networks has been a signi ﬁ cant challenge in network science over the last decade. Practical signi ﬁ cance spans from graph theory to interdisciplinary ﬁ elds like biology, sociology, economics, and marketing. Despite rich literature in this direction, we ﬁ nd small notable e ﬀ ort to consistently compare and rank existing centralities considering both the topology and the opinion di ﬀ usion model, as well as considering the context of simultaneous spreading. To this end, our study introduces a new benchmarking framework targeting the scenario of competitive opinion di ﬀ usion ; our method di ﬀ ers from classic SIR epidemic di ﬀ usion, by employing competition-based spreading supported by the realistic tolerance-based di ﬀ usion model. We review a wide range of state-of-the-art node ranking methods and apply our novel method on large synthetic and real-world datasets. Simulations show that our methodology o ﬀ ers much higher quantitative di ﬀ erentiation between ranking methods on the same dataset and notably high granularity for a ranking method over di ﬀ erent datasets. We are able to pinpoint — with consistency — which in ﬂ uence the ranking method performs better against the other one, on a given complex network topology. We consider that our framework can o ﬀ er a forward leap when analysing di ﬀ usion characterized by real-time competition between agents. These results can greatly bene ﬁ t the tackling of social unrest, rumour spreading, political manipulation, and other vital and challenging applications in social network analysis.

There is considerable effort devoted to assessing the importance of nodes in many types of complex networks over the last decade.Novel approaches, combined with classic graph centrality measures, have led to the emergence of the three main categories of influence ranking methods.The first category of scientists argues that the location of a node is more important than its immediate ego network and thus proposed k-core decomposition [28,29], along with improved variants, such as [30][31][32][33].The second category of scientists quantifies the influence of a node based solely on its local surroundings [34][35][36].Finally, the third category of scientists evaluates node influences according to various states of equilibrium for dynamical processes, such as random walks [37,38] or step-wise refinements [39].
Each ranking method, regardless of its nature and category, is validated through a state-of-the-art benchmarking methodology, which-in almost all cases of network science-involves the usage of the SIR epidemic model [40][41][42].This process may be suitable for validating metrics in an individual context in order to produce a verdict whether the ranking method is good enough, but often not more.For SNA, however, collective interplay is inherent [43] and the aforementioned real-world application contexts imply competition between multiple opinions, so a one-sided perspective will often not be reliable.The recent study shows that the traditional SIR model provides a poor description of the data for modelling disease dynamics, as it lacks infectious recovery dynamics, which is a better description of social network dynamics [44].Consequently, we consider that the SIR model would be inadequate to apply in our benchmarking context, as it fails to model competition and opinion fluctuations.As such, we propose a more robust benchmarking principle that implies simultaneous competition between two or more information (opinion, rumour) sources, that is, in the same network and at the same time.To this end, we make use of the existing tolerance-based diffusion model [45], which represents, to the best of our knowledge, a novel benchmarking methodology in SNA.
To better underline the limitations incurred by using a SIR simulation versus our proposed competition-based benchmark, we illustrate a comparative example in Figure 1.In (a) and (b), we apply two distinct ranking methods (orange and blue), one at a time, and show that the diffusion process is unrestrained, also we suggest that orange manages to cover the network in time T 1 , faster than blue with T 2 , due to the higher dispersion of three initial orange opinion sources.In the SIR context, the two simulations may lead to the conclusion that the orange ranking method is better than the blue one.In reality, we consider the scenario in (c) as the more probable one.Opinions will diffuse simultaneously and face constraints due to competition over each node (i.e., orange and blue exclude one another).In this case, we intuitively suggest that blue might win in terms of It is suggested that the orange diffusion time T 1 is shorter (better) than the blue T 2 time, due to the higher, more uniform dispersion of orange opinion sources.However, in reality (c), none of the two opinions may fully cover the network in optimal times T 1 or T 2 , nor will they achieve such high coverages as 2 Complexity network coverage, as it has a tighter initial cluster forming around its three opinion sources.Consequently, the main observations are the following: (i) None of the two opinions will achieve coverages as high as in the one-sided scenarios, that is, C 3o and (ii) Simulation time T 3 may be longer than T 1 ≈ T 2 , due to the need for attaining a state of balance in the emergent network.
(iii) The final ratio of opinion C 3o /C 3b is impossible to determine by one-sided simulations and is only determinable by the emergence of the two competing opinions (e.g., initial spreader position, connectivity of the spreaders, and community structure).
In light of these remarks, we propose a novel benchmarking framework which offers more reliable insights into comparing ranking methods aimed at real-world applications of social networks.The paper starts by presenting the benchmarking methodology in detail, followed by simulation results.We highlight the overlapping of several popular ranking methods, in terms of selecting the same initial seeds, then proceed to compare the ranking methods using SIR as a reference and then in pairs (one versus one) using our proposed methodology.Finally, we discuss the results, the difference in what our testing methodology can offer, and what are the implications of considering competing opinion.The Methods section details the used validation datasets and a brief review of current state-of-the-art ranking measures used in complex networks.

A Novel Competition-Based Influence Ranking Benchmark
State-of-the-art benchmarking methodologies for spreading processes on complex networks often rely on the SIR (SIS) model [40][41][42].With this approach, an initial subset of nodes is infected according to a centrality measure, then the simulation measures how fast surrounding susceptible nodes become recovered (i.e., including dead).Indeed, if we take the example of an epidemic, it spreads independently from other epidemics and has its own temporal evolution.On the other hand, if we consider opinion between social agents, it is often exclusive (in regard to other contradicting opinions) and is also dependent on the timing with the spread of other ideas.We argue that a SIR model cannot accurately model fluctuations and direct competition between social agents.Also, as long as the infected nodes survive, they will eventually tamper with the whole network.Finally, the SIR model is sensitive to initial parameters, like infectious probability λ and recovery duration δ, needing step-wise refinements to obtain desired results, which may vary easily in other experimental settings.Alternatively, we find several variants of the SIR model designed for competitive diffusion processes, such as the SI 1 I 2 S [5], SI 1|2 S [6], and SI 1 SI 2 S [7] models, but they are targeting competitive epidemic diffusion.
As a novel, more robust, and more realistic alternative, we propose the usage of the tolerance-based model [45] which implies competition between two or more opinion sources in the same network, at the same time.To the best of our knowledge, this kind of benchmarking methodology is novel to literature.Other graph-based predictive diffusion models [46] include the classic linear threshold LT [47], independent cascade IC [48], voter model [49], Axelrod model [50], and Sznajd model [51].These models use either fixed thresholds or thresholds evolving according to simple probabilistic processes that are not driven by the internal state of the social agents [46].However, the tolerance model is the first opinion diffusion model to propose a truly dynamic threshold (i.e., a node's state evolves according to the dynamic interaction patterns).Therefore, based on its novelty and realism potential, we are encouraged to use the tolerance model in our paper.
2.1.The Tolerance-Based Opinion Diffusion Model.The tolerance model [45] is based on the classic voter model [49], being a refinement of the stubborn agent model [11,52], with the unique addition of a dynamic decision-making threshold, called tolerance θ i , for each node.
We further introduce the specific network science notations to mathematically define our model.Given a social network G = V, E , the neighbourhood of node v i ∈ V is defined as N i = v j | e ij ∈ E .Exemplifying for a context with two competing opinions, we introduce two disjoint sets of stubborn agents V 0 , V 1 ∈ V which act as opinion sources.Stubborn agents never change their opinion, while all other (regular) agents V \ V 0 ∪ V 1 update their opinion based on the opinion of one or more of their direct neighbours.We represent with x i t the opinion of agent v i at time t.Normal (regular) agents start with a random opinion value x i 0 ∈ 0, 1 .We represent with s i t the state of an agent v i at moment t having continuous opinion x i t .In case of a discrete opinion, representation x i t = s i t , and in case of a continuous opinion, representation s i t is given in the following equation.
In the assumed social network, agents v i and v j are neighbouring nodes if there is an edge e ij that connects them.Some agents may not have an opinion or may not participate in the diffusion process (i.e., s i t = none), so interacting with these agents will generate no opinion update.A regular node will periodically poll one random neighbour (simple diffusion) or all its neighbours (complex diffusion), average the surrounding opinion x N i t (i.e., vicinity N i of an arbitrary node v i , at time point t), and update its opinion x i t using a weighted combination of the past opinion and that of its neighbour(s), as The tolerance θ i parameter is the amount of accepted external opinion and changes after each interaction based on whether a node has faced competing opinion or supporting opinion (in a binary context with opinions A and B).Once a node is in contact with the same opinion for a long enough time, it becomes intolerant (θ i t = 0), so that the network converges towards a state of balance [53].Opinion fluctuates and is transacted by all nodes, but stubborn agents are the only nodes which do not become influenced in turn, acting as perpetual sources for the same opinion [11].
The evolution towards both tolerance and intolerance varies in a nonlinear fashion, as an agent under constant influence becomes indoctrinated at an increased rate over time.If that agent faces an opposing opinion, he will eventually start to progressively build confidence in the other opinion.As such, the tolerance model employs a nonlinear fluctuation function, unlike most models in literature [54,55].Based on realistic sociopsychological considerations in the dynamical opinion interaction model, we model tolerance evolution as Tolerance is decreased by −α 0 ε 0 if the state of the agent before interaction, s i t − 1 , is the same as the state of the randomly interacting neighbour s j t .If the states are not identical (i.e., opposite opinion), then the tolerance will be increased with the dynamic product of +α 1 ε 1 .The two scaling factors, α 0 and α 1 , both initialized with 1, act as weights (i.e., counters) which are increased to account for every event in which the initiating agent keeps its old opinion (i.e., tolerance decreasing) or changes its old opinion (i.e., tolerance increasing).Therefore, scaling factor α 0 is increased by +1 as long as an agent interacts with another agents having the same state (i.e., s i t − 1 = s j t ) and is reset to 1 otherwise.
Scaling factor α 1 is increased as long as the interacting state is always different from that of the agent and is reset if the states are identical.We introduced the scaling factors to model bias and used to increase the magnitude of the two tolerance modification ratios ε 0 (intolerance modifier weight) and ε 1 (tolerance modifier weight).The two ratios are chosen with the fixed values of ε 0 = 0 002 and ε 1 = 0 01.We have determined these values as explained in [45].
In accordance with this presented mechanism, we designate two sets of stubborn agents, V a and V b , to act as initial spreaders simultaneously.In other words, we let all chosen centrality metrics compete against each other in a one-to-one diffusion scenario, where sets V a and V b consist of the top p% spreaders selected by each two pairs of centralities.We ensure that V a ∩ V b = 0 and V a = V b , with p = 0 05.We find this approach to offer a good qualitative comparison basis for estimating the effectiveness of node ranking methods.

Alternate Opinion Assigning Approach.
We further find that most state-of-the-art ranking methods have various degrees of overlapping in terms of the top spreader nodes they assign.As such, we introduce an alternate opinion assigning (AOA) approach in order to distribute nodes in the two sets of spreaders V a and V b evenly and equitable for both ranking methods, say A and B. Figure 2 exemplifies the AOA approach, where ranking method A is depicted with orange and method B is depicted with blue.
AOA means that each one-to-one influence ranking benchmark consists of two (or multiple of two) independent simulations.Considering that ranking methods A and B produce two partially overlapping sets of top p% spreaders, we alternate the simulations as follows: (i) In the first simulation, method A (orange) has priority: one starts by assigning the first (top 1) spreader from V a as an orange stubborn agent.This implies As such, a simulation of orange versus blue ranking methods translates into two independent simulations, slightly favouring each method in turn.The assigning of opinion is always evenly distributed in terms of number of nodes, for example, 3 spreaders in this example.4 Complexity that the spreader remains in V a and is removed from V b , if present.
(ii) Then, the first spreader from V b is assigned as a blue stubborn agent, removing it from V a , if present.
(iii) Alternatively, we assign nodes alternative opinion and filter them out from the other list of spreaders.
(iv) The AOA stops when min V a , V b = p × N/2 and discards any extra node so that V a = V b , ensuring that both sets V a and V b have an equal number of stubborn agents, namely, half of the desired p × N spreader population.
(v) In the second simulation, method B (blue) has priority: one starts by assigning the first (top 1) spreader from V b as a blue stubborn agent.This implies that the spreader remains in V b and is removed from V a , if present.
(vi) The exact same AOA process is repeated, with B having priority over A.
The impact of AOA is highlighted in Figure 2, as we end up assigning two significantly different spreader sets for methods A and B. Methodologically speaking, one benchmark must consist of at least two simulations, but for better experimental results, one may run 2k simulations, ensuring that AOA is applied (i.e., k simulations favouring method A and k simulations favouring method B).

Results
We set out to discover fundamental drivers in the underlying graph structure which shape and influence opinion spreading in complex networks.To this end, our experimental setup is focused on a comparative benchmark analysis involving the reviewed node centrality metrics defined in Section 5.2.For an objective comparison, we make use of two types of datasets: synthetic data (10,000 node random, mesh, small-world, and scale-free networks [56]) and real-world data (consisting of large, representative complex networks sized between 1900 and 29,000 nodes).
In this section, two sets of results are detailed.First, we explore the correlations between ranking methods for assigning top spreaders.Naturally, within the top p% of nodes ranked by different centralities, we will eventually find common nodes.As such, we detect the amount of node overlapping O ab = V a ∩ V b and express the correlation of the two measures as corr ab = O ab / V a and corr ab ∈ 0, 1 .For the second experimental phase of benchmarking influence ranking methods, we ensure that V a ∩ V b = 0 by alternatively assigning a node to each set, while removing it from the list of candidates of the other centrality, as explained by the AOA approach (Figure 2).

Correlations between Influence Ranking Methods.
Realworld datasets can be viewed as topological compositions of the basic graph properties found in synthetic Erdos-Renyi random (Rand), Forest-fire mesh (Mesh), Watts-Strogatz small-world (SW), and Barabasi-Albert scale-free (SF) networks [56][57][58], so we solely rely on measurements on the synthetic datasets from Table 1.As such, the correlation process is applied on the four synthetic network types in order to better highlight distinguishable characteristic topological features, like uniform node degree distribution (random networks), high local clustering and community formation (mesh networks), and high clustering and long-range links (small-world), respectively, low average path length, and hub formation (scale-free).
Figure 3 presents the correlations corr ab between 10 × 10 selected pairs of centralities; correlations are measured by considering the following spreader set sizes: V a = V b = p × N, where p ∈ 0 01, 0 05, 0 1 and N is the size of the graph, and find that corr ab will drop slightly as p increases.The average changes δ in spreader correlations from p = 0 01 up to p = 0 1 are δ Rand = −0.289,δ Mesh = −0.193,δ SW = −0.189,and δ SF = −0.088.This overall drop in correlation can be explained as follows: more of the same nodes are determined as top spreaders by ranking methods when the spreader sets are small.As p increases, each ranking method adds more nodes to the set of spreaders and the chances of overlapping drop.However, when we look at each individual centrality measure in turn, we notice that some increase the correlation amount, while others drop that amount.Section 1 and Figure 1 in the Supplementary Materials detail and discuss these measurements for 10 selected ranking methods, over the four synthetic topologies, as p increases.As a representative overview, we present in Figure 3 only the results for p = 0 1.For each centrality combination, we provide the numerical correlation and a symmetric graphical correlation.For example, the correlation degree-Hirsch index in the random network is corr Deg-HI = 0 576, which translates into a mid-blue gradient in the table symmetric cell HI-Deg.The last column in the table expresses the average correlation on each line.Summing up and averaging the values on the last column, we obtain the cumulated correlations for each topology as corr Rand = 0 552, corr Mesh = 0 497, corr SW = 0 606, and corr SF = 0 741.
Quantitatively and also intuitively, the highest spreader correlation is obtained on the scale-free network, as it naturally consists of a very small core of hub nodes.These hubs act like an invariant to p in the topology and are likely to be selected as top spreaders by all centrality measures.Even if p is changed, the correlation remains high (see Supplementary Materials, Section 1).On the opposite spectrum lie the  random network, mesh network, small-world (SW), and scale-free (SF) network.The blue colour intensity of a cell corresponds to the strength of correlation found in the symmetric cell, that is, cell colour i, j ~cell value j, i .A stronger blue intensity denotes a stronger correlation.
6 Complexity random and mesh topologies.Both are characterized by uniformity in node properties, so that various centralities will have a higher heterogeneity in their top spreader selection, leading to the smaller measured correlations.Lastly, the small-world network borrows the uniformity of meshes and the long-range links of a random network.Here, we measure a relatively high average correlation of 0.606, denoting that this network has a stable core of influential nodes, like the scale-free network.
Analysing each centrality in turn, we notice that there are higher correlations between ranking methods of the same category, for example, diffusion-based HITS, PageRank, and LeaderRank.Furthermore, some centralities are more suitable for some topologies and less efficient for others.For example, we confirm that degree is considerably more relevant for scale-free networks (correlation of 0.802 with other centralities), but only marginally relevant for the smallworld network (correlation of 0.437).The same observation is consistent with closeness and betweenness.To better highlight the spatial overlapping of spreader nodes, we provide a visual example in the Supplementary Materials, Section 2.
Arching over the presented results, we motivate the usage of alternate opinion assigning (AOA), because we find high node overlapping, ranging between 30% and70%, between all state-of-the-art centralities.
3.2.Independent SIR Simulations.For a comparative basis, we first estimate the efficiency of an influence ranking method by employing classic SIR simulation [41,42].In this sense, we measure both the time needed to infect the majority of nodes (expressed in simulation iterations τ) and the final coverage of the infection (expressed as a percentage ρ of the total network size).We use the following SIR-specific parameter values [40,41]: p = 0 05 (i.e., top 5% nodes selected as spreaders), k = 0 95 (i.e., at least 95% population to be infected as a stop condition), λ = 0 05 (i.e., 5% probability to become infected during an interaction), and δ = 10 (i.e., 10 iteration duration of infectious state for a node).
The simulation results in Table 2 represent averaged values for τ and ρ by running 10 repeated simulations on each dataset, for each individual ranking method (i.e., amassing to a total of 10 ⋅ 8 ⋅ 10 = 800 simulations).Through these results, we want to highlight that running a diffusion process for each ranking method in an individual manner (i.e., one by one), the provided feedback regarding ranking efficiency, is often limited.
The results for most topologies are very close in terms of measured τ and ρ, suggesting that differentiation between ranking methods is unreliable.For instance, analysing the coverages ρ in Table 2, the average coverage for Rand is ρ Rand = 95 47% with a standard deviation of only σ Rand = 0 082.The measured difference Δ between the most efficient ranking method (Hirsch index) and least efficient ranking method (degree) is only Δ Rand = 0 3% on the Rand network.Similarly, the standard deviations σ for real-world networks are σ OSN = 0 214, σ FB = 0 042, σ Emails = 0 230, and σ POK = 0 273.The differences Δ between the most and least efficient ranking methods are roughly Δ OSN = 1 4%, Δ FB = 0 4%, Δ Emails = 2%, and Δ POK = 5 5%.For a visual representation of the coverage ρ benchmark results refer to Supplementary Materials, Section 4.
We consider these simulation results to highlight an overall lack of perspective regarding which ranking method is better on a given topology.Likewise, the best ranking methods are not consistent across datasets.For instance, HITS turns out to be the most efficient ranking method on a SW, but the least efficient on a SF network; Deg is least efficient on Rand, 2nd on Mesh, 7th on SW, and 6th on SF, yet it comes 8th if we average all results; Btw is the 5th on OSN, 4th on FB, 5th on Emails, and 3rd on POK, and comes 3rd overall.This kind of inconsistency further supports our claims for an improved type of benchmarking methodology.

Competition-Based Simulations.
We let each of the n = 10 selected centrality measures compete in a one-to-one scenario over the 4 synthetic and 4 real-world datasets.Every dataset comprises a total of n × n − 1 /2 = 45 pairs of simulations, translating into 2 × 45 = 90 individual simulations due to AOA.For statistical rigour, each experiment is repeated 10 times, consisting of a simulation batch of 20 simulations, leading to 45 × 20 = 900 simulations per dataset, amassing to an overall 8 × 900 = 7200 unique experiments.The large quantity of numerical results is available in the Supplementary Materials, Section 3 and Tables 1 and 2. Condensing the simulation results, we present in Table 3 the average performance of the 10 ranking methods on the 8 datasets.This performance is quantified as an average percentage of opinion coverage obtained from the one-to-one competition benchmarks (e.g., HITS obtains a coverage of 65.23% on the OSN dataset).
Similar to the state-of-the-art SIR epidemic benchmarking, our obtained results are easy to understand and offer the possibility of direct comparison between ranking methods on the same dataset.On the other hand, we notice two improvements by applying our methodology: (1) There is much higher variation between measures on the same dataset.For example, on the FB dataset, we obtain Deg = 59 31% and Cls = 4 28%, which suggest an obvious performance difference.On the other hand, using SIR as benchmark, the coverages are ρ Deg = 95 31% and ρ Cls = 95 17%.
(2) There is greater emergent granularity between measures on different datasets.For example, Cls turns out to be much less efficient on a SF topology (1.99%) than on a SW topology (18.37%).
Assessing the results in Table 3, we find an objective comparison of state-of-the-art ranking methods used in current social networks research.Figure 4 presents these cumulated performance indicators; the top three ranking methods, according to our original proposed methodology, are Leader-Rank (LR), HITS, and node degree (Deg).
The cumulated results in Figure 4 are based solely on the 8 datasets used throughout the paper.With more datasets used, the averaged performances will slightly differ.However, valuable insight is further offered by the visualization of performances on each dataset in turn; these results are detailed in the Supplementary Materials, Section 5.
Additionally, we provide a suggestive visual example of the opinion coverages at the end of a simulation, after balancing is attained [53] with our used tolerance diffusion model [45].The Mesh topology is exemplified here because it offers the most intuitive 2D spatial feedback after applying a forcedirected layout.To this end, Figure 5 shows the coverage of competing centrality measures in three different scenarios: (iii) Two ranking methods with low overlapping and extreme outcome: Cls (orange) 5.24% and HI (blue) 94.76% (Figure 5(c)).
The validation of our novel benchmarking methodology employs a standard strategy for the selection of multiple spreaders.After a review of the most recent advances in complex network analysis, we find that the method of simply selecting the top spreaders from the entire network is consistently found throughout literature [35,37,38,[59][60][61][62].Nevertheless, there are several alternatives for selecting multiple spreaders which we detail in the Supplementary Materials, Section 6.

Comparison between
Benchmarking Methods.To highlight the superior quantitative power of our competitionbased benchmark, we aggregate the results in Table 4. Here, we measure the difference Δ min−max between the most and least efficient ranking methods and the difference Δ 1−2 between the top two ranking methods, for each dataset in turn.Seeking higher overall differences, we find that our proposed benchmarking methodology is more insightful, in general, than the classic SIR benchmark.As such, when measuring Δ min−max , individual SIR benchmarking only manages to produce differences of ≈0 06 − 1 59% (1.14% on average) between ranking methods, while our proposed solution offers differences of ≈80 − 98% (91% on average).When trying to discern between the top 2 ranking methods on a particular dataset, SIR manages to place them apart by only ≈0 − 1 07% (0.31% on average), while our method manages to produce higher differences within ≈0 28 − 8 75% (3.56% on average).
Another advantage of our proposed method is the overall uniformity obtained for the performances of each centrality across the 8 selected datasets.For instance, if LR and HITS result as the most efficient spreading methods on one topology, their performance is replicated with high confidence on the other topologies as well.When employing SIR benchmarking, the performances are not consistent across datasets.This aspect is suggested visually in Figure 6, where we highlight the most (LR) and least (Cls) efficient centralities, as they are ranked over the 8 datasets.It is easy to notice how LR is positioned in the top 3 and Cls in the last 2-3 methods overall.In the individual SIR benchmarking, there is no such uniformity.
In conclusion, our benchmarking methodology-which is specifically designed for the competitive social network context-provides significant quantitative separation between influence ranking methods on synthetic and real social network topologies.This numerical separation is over one order of magnitude greater than the one provided by classic SIR simulation-a standard methodology used in epidemic spreading, where the diffusion context is less competitive and more ego-centred.Therefore, we encourage the use of our proposed method in specific real-world applications of dynamic social networks.

Discussion
One of the significant research challenges in network science is to rank a node's ability to spread information in a network [43].As spreading is used to model real-world processes such as epidemic contagion and information propagation [2,3,20,22,63], our paper aims to improve current methodology in validating and comparing state-of-the-art ranking methods in the social network context.Numerous alternative ranking  Table 4: Comparison between individual SIR and our simultaneous competition-based benchmark in terms of how well ranking methods are differentiated.Δ min−max is the difference (%) between the most and least efficient ranking methods; Δ 1−2 is the difference (%) between the top 2 ranking methods on each dataset.Higher differences are better.
( 9 Complexity methods have been developed, relying on classic graph centralities, localized targets [63], optimal percolation [43], and so on.While the challenge at hand remains partially unsolved, it is argued that insights are uncovered only through the optimal collective interplay of all the influencers in a network [43].This emergent behaviour is also the key to our study, namely, the introduction of a benchmarking technique employing simultaneous competition-based spreading.
The main motivation of this paper is the need for increased realism in the social network context, where realworld applications imply simultaneous diffusion by their nature.Nevertheless, our methodology may be tailored to other interdisciplinary fields of science.One area of research that can benefit directly from our methodology is network biology.Specifically, determining node centrality is a hot topic in biological networks.For instance, a study shows that the phenotypic consequence of a single gene deletion is determined by the topological position in the molecular interaction network [64]; also, the relationship between the network roles of disease genes and their tolerance to germs shows that cancer driver genes occupy the most central positions [65].Many biological studies rely on the theoretical results from network science, and they often only employ degree and betweenness centrality in their analysis.With our study, we aim to broaden the methodological perspective for interdisciplinary fields.
We find advantages over existing benchmarking methodology relying on the SIR epidemic model.Notably, our competition-based method offers much greater quantitative separation between ranking methods on the same dataset (e.g., degree is roughly 14 times more performant than closeness on the Facebook dataset); also, we obtain higher granularity for a ranking method on different datasets (e.g., closeness is roughly 9 times less efficient on a scale-free topology than on a small-world topology).
Further development ideas of our method are possible.For instance, one can increase the number of spreaders acting simultaneously in a network from 2 to k > 2. Accordingly, alternate opinion assigning (AOA) must be modified to fit the k opinion sources.The recent study discusses the importance of targeting specific localized targets, rather than obtaining a high coverage of the network [63].Our method can be easily implemented to measure the target coverage during or at the end of a spreading simulation.Another study finds that each complex network may have a small "control set" of nodes, which, when triggered, will influence the whole network [66].These control sets are believed to be surprisingly small (5-10% of nodes) and may also be paired with our benchmarking methodology.
Finally, we consider that the topology-aggregated competition-based results we obtained (e.g., in Figure 4 of the Supplementary Materials) can be used to define a functional fingerprint of real-world networks based on how influence ranking methods perform on them.Namely, we notice that the 10 used centrality measures perform in a unique, distinguishable manner on the four fundamental synthetic topology models.This uniqueness can be quantified as a characteristic vector for random, mesh, small-world, and scalefree networks.Any real-world dataset can then be compared to other datasets through these four fingerprint vectors.Overall, we believe that our work improves a significant challenge in the study of opinion spreading phenomena and also serves as a good starting point for many of the still unsolved problems and new ideas found in literature.

Validation Datasets.
We motivate the inclusion of synthetic datasets into the study to clearly distinguish between characteristic topological features of the network that influences spreading.These features include a normal versus power-law degree distribution, lower versus higher clustering, lower versus normal path lengths, existence of long-range links, or hub formation, respectively.The four chosen network models represent the four fundamental topology types out of which empirical networks are further built [26,56,57].
With a higher interest on influence spreading pertaining to the field of social network analysis, we choose four undirected (weighted and unweighted) networks consisting of various types of social relationships.As such, we rely on a weighted online social network (OSN) with 1899 users [67], an unweighted Facebook friendship network (FB) consisting of the 3172 students from a Computer Science faculty in Romania [68], an unweighted email exchange network (Emails) from London's Global University with 12,625 contacts [69], and a weighted friendship network (POK) with 28,876 users from the Slovakian POK platform [70].On the other hand, all synthetic networks consist of 10,000 nodes 10 Complexity and are algorithmically generated using default parameters found in the state of the art.Table 1 provides the basic statistics for each such network.

Influence Ranking Methods.
In order to define each centrality metric, we make use of the following graph theoryspecific notations.A social network is a graph G = V, E formed out of V number of nodes and E number of edges.
The edges may also be directed (i.e., e ij ≠ e ji ) or weighted (i.e., they have weights w ij ).The connectivity of the graph is characterized by an adjacency matrix A = a ij , where a ij = 1 (or w ij in weighted context) if nodes v i and v j are connected and 0 otherwise.Furthermore, the degree of a node v i is denoted as k i , the neighbourhood of a node is the set of nodes v j ∈ N i , and the average degree of G is k = 2E/V.The reviewed measures considered for benchmarking in this paper are classified in one of three categories: structure-based, location-based, and diffusion-based rankings.Under local measures, we first mention degree centrality (Deg) k i of a node v i ; it is easy to use and efficient but less relevant in some real-world scenarios [34,38], as some studies show that Deg fails to identify influential nodes because it is limited to the ego network of each node [34,71].
The local centrality (LC) measure was introduced as a trade-off between the low-relevant degree centrality and other time-consuming measures [34].LC of node v i considers both the nearest and the next nearest neighbours and is defined as where N i is the vicinity (set of neighbours) of node v i , N v k is the number of the nearest and the next nearest neighbours of node v k , and Q v j is sum of N v k over each node in N i .LC can be considered as more effective than degree centrality because it uses more information from the vicinity of distance 2 but has much lower computational complexity than betweenness and closeness centralities.
Another method considered a local ranking measure is ClusterRank (CR), proposed by Chen et al. [35].CR quantifies the influence of a node v i by taking into account not only its direct influence (out-degree k out i ) and influences of its neighbours (like in the case of PageRank) but also its clustering coefficient c i [56].Formally, the ClusterRank score C R v i of a node v i is defined as where the term f c i represents the effect of v i 's local clustering, the term +1 results from the contribution of v j itself, and N i is the vicinity of node v i .Based on empirical analysis [35], the authors propose the exponential function f c i = 10 −c i .The local centrality with a coefficient, denoted as CLC by Zhao et al. [71], is a combination of the previous CR and LC methods.The number of neighbouring nodes is measured to identify cluster centres and is combined with a decreasing function f for the local clustering coefficient of nodes, called the coefficient of local centrality c v i , namely, f c v i = e −c v i .Mathematically, the influence of node v i is measured as Considering the global information of the graph can give better insights, so we adopt the widely used betweenness (Btw) and closeness (Cls) centralities [56].Betweenness of a node v i is expressed as the fraction of shortest paths between node pairs that pass through the node v i and is defined as [26] Btw where σ jk is the number of shortest paths between nodes v j and v k and σ jk v i denotes the number of shortest paths between v j and v k which pass through node v i .Closeness centrality of a node v i is defined as the inverse of the sum of distances to all other nodes in G; it can be considered as a measure of how long it will take to spread information from a given node to other reachable nodes in the network [56]: 5.2.2.Location-Based Measures.Location-based measures also require the structural information of the graph but focus around the belief that the location of a node in a network is a more relevant.Driven by the limitations of simple graph metrics, such as degree centrality, Kitsak et al. propose k-core decomposition to quantify a node's influence based on the assumption that nodes in the same shell have similar influence, and nodes in higher-level shells are likely to infect more nodes [28].To this end, the k-core decomposition method was validated by several studies [28,29].While this method is often found in literature under both the names of k-core or k-shell decomposition, the two concepts differ.The k-core of a graph is the maximal subgraph such that every vertex has degree at least k.A k-shell (KS), on the other hand, is the set of vertices that are part of the k-core but not part of the k + 1 th-core.
Experiments show that by running a diffusion process on the network (e.g., SIR), the nodes with the same k s values always have different number of infected nodes, namely, spreading influence [32].This phenomenon suggests that the k-core decomposition method is not appropriate for ranking the global spreading influence of a network.Liu et al. [32] propose to solve this observed drawback by taking 11 Complexity into account the shortest distance between a target node and the node set with the highest k-core value.In terms of the distance from a target node v i to the network core G c , the spreading influences of the nodes with the same k-core values can be distinguished using the following equation: In (9), k max s is the largest k-core value of G, d ij is the shortest distance from node v i to node v j ∈ G c , G c is the network core, and G k s is the node set whose k-core values equal k s .
In this paper, we also make use of the Hirsch index.The h-index (HI) [72] is a hybrid location-local-based centrality in which every node needs only a few pieces of information: the degrees of its neighbours.It was originally developed as a means to measure the scientific impact of scholars, but it now finds uses in quantifying the influence of users in social networks or drugs in pharmacological interaction maps.The h-index of a node v i is defined as the largest value h so that v i has at least h neighbours with a degree ≥h.
The algorithm is intuitive to apply, namely, for a node v i with vicinity N i , we order all its neighbours v j ∈ N i in descending order of their degree k v j .The h-index HI v i is the position h − 1 in the ordered list of nodes at which the degree of a neighbour becomes smaller than the position in the list.For example, given the list of degrees L v i = 10, 8, 7, 6, 3, 1, 1 , we deduce HI v i = 4, because L v i 4 > 4, but L v i 5 < 5.

Diffusion-Based Measures.
Diffusion-based measures are based on obtaining a state of balance in the network after applying a nondeterministic spreading processes, like a random walk.We make use of the fundamental eigenvector centrality (EC), which supposes that the influence of a node is not only determined by the number of its neighbours (i.e., degree centrality) but also by the influence of each neighbour [73].Inspired by EC, there are three additional algorithms we discuss in this paper.
PageRank (PR) was first implemented as a random walk on the network of hyperlinks between web pages [74].A damping factor d is introduced as the probability for a user to jump to a random website, and 1 − d is the probability for the user to continue browsing through hyperlinks.The influence s t v i of a node v i at time t is given by where V is the number of nodes in G, k out j is the out-degree of node v j , and d = 0 85, but d requires step-wise optimization based on the network.
HITS is similar to PR, based on the concept that good hub nodes will point to good authority nodes, and good authorities will point by good hubs [75].The hub score of all nodes at time t = 0 is initialized with 1; the authority score Aut t v i , at any moment in time t, is expressed as Finally, the LeaderRank (LR) algorithm represents an improvement over PR, since the probability parameter is adaptive, leading to a parameter-free algorithm directly applicable on any type of the complex network [37].The method is applied by adding an additional ground node v g that is connected to all other nodes, ensuring the graph is connected.A random walk then adds a score of +1 to each visited node v i .The ground node starts with s g 0 = 0, and all other nodes in G have s i 0 = 1.Using the notation s t v i at time t for a node v i , the evolving score can be expressed as The score s t v i is proven to converge towards a steady state at time t c [37]; the score of the ground node is then evenly distributed to all other nodes V ∈ G to conserve the scores on the nodes of interest.The final, stable LR score is expressed as as spreaders, as determined by the degree, closeness, betweenness, and PageRank centralities, respectively.Table 1: synthetic dataset (i.e., random, mesh, small-world, and scale-free) benchmark results for pair-wise competition between centrality measures.Each cell (x, y) contains the final opinion coverage (0-100%) for centrality x; the symmetric cell (y, x) represents the same number on a colour gradient blue (0%), white (50%), and orange (100%).Table 2: real-world dataset benchmark results for pair-wise competition between centrality measures.Each cell (x, y) contains the final opinion coverage (0-100%) for centrality x; the symmetric cell (y, x) represents the same number on a colour gradient blue (0%), white (50%), and orange (100%).Figure 3: performance of each ranking method (i.e., coverage 0-100%) on the 8 datasets using individual SIR benchmarking.Figure 4: performance of each ranking method (i.e., coverage 0-100%) on the 8 datasets using simultaneous competition-based benchmarking.

Figure 1 :
Figure1: Example of the incurred limitations when benchmarking a diffusion process only from a single opinion's point of view, when the real-world context implies simultaneous diffusion and competition between multiple opinions.It is suggested that the orange diffusion time T 1 is shorter (better) than the blue T 2 time, due to the higher, more uniform dispersion of orange opinion sources.However, in reality (c), none of the two opinions may fully cover the network in optimal times T 1 or T 2 , nor will they achieve such high coverages as C 1 or C 2 , that is, T 1 ≈ T 2 < T 3 and C 1 ≈ C 2 > C 3o , C 3b .

Figure 2 :
Figure2: Example of the alternate opinion assigning approach in order to offer both competing ranking methods even chances of propagation.The coloured nodes marked with indices 1-5 represent the top 5 orange, respectively, blue spreaders, as determined by the two ranking methods.Moreover, some of these spreaders overlap ((a) e.g., 3/5 means 3rd best orange spreader and 5th best blue spreader), so we assign each spreader node one of two opinions (orange/blue) alternatively, starting with orange first (b) then blue first (c).As such, a simulation of orange versus blue ranking methods translates into two independent simulations, slightly favouring each method in turn.The assigning of opinion is always evenly distributed in terms of number of nodes, for example, 3 spreaders in this example.

Figure 3 :
Figure3: Ratio of nodes overlapping in the top 10% (N = 10K nodes) of spreader assignment by 10 centrality metrics (degree, closeness, betweenness, HITS, PageRank, Hirsch index, LeaderRank, k-shell, local centrality, and eigenvector centrality) in an Erdos-Renyi (Rand) random network, mesh network, small-world (SW), and scale-free (SF) network.The blue colour intensity of a cell corresponds to the strength of correlation found in the symmetric cell, that is, cell colour i, j ~cell value j, i .A stronger blue intensity denotes a stronger correlation.

Figure 6 :
Figure 6: Visual representation of the uniformity in benchmarking influence ranking methods across different networks.We highlight the positions obtained by LR (top centrality in terms of spreading) and Cls (least effective centrality) across our 8 datasets in the context of individual (a) and competition-based (b) benchmarks.The position of a centrality on the vertical corresponds to its obtained rank (1-10) after benchmarking.For example, LR is the 5th best on random and 10th best on Mesh.

Figure 5 :
comparison between the naïve (a-c) and graph colouring (d-f) methods using three competitive diffusion examples on the mesh network (N = 10,000 nodes).Larger nodes represent spreader nodes.The first centrality in the figure captions corresponds to orange opinion and the second centrality to blue opinion.

Figure 6 :
difference in spreader spacing for closeness (orange) when switching from the naïve method (a) to the graph colouring method (b).

Table 1 :
Graph statistics of the eight datasets detailing the number of nodes, edges, average degree ( k ), maximum degree (k max ), average path length (APL), average clustering coefficient (ACC), and network diameter (Dmt).

Table 2 :
Performance of ranking methods expressed as the time τ needed to infest a network (lower is better) and the final coverage ρ, expressed as a percentage of the network size (higher is better), using SIR benchmarking.

Table 3 :
Average performance of the 10 ranking methods on the 8 datasets.Performance is expressed as opinion coverage (%) obtained in the one-to-one opinion diffusion competitions with every other ranking method.

Table 3 :
comparison between the naïve and graph colouring methods in terms of selecting spreader nodes.Performance is expressed as percentage (%) for each node centrality in three competitive simulation scenarios.(Supplementary Materials)