Targeted Influential Nodes Selection in Location-Aware Social Networks

Given a target area and a location-aware social network, the location-aware influence maximization problem aims to find a set of seed users such that the information spread from these users will reach the most users within the target area. We show that the problem is NP-hard and present an approximate algorithm framework, namely, TarIM-SF, which leverages on a popular sampling method as well as spatial filtering model working on arbitrary polygons. Besides, for the large-scale network we also present a coarsening strategy to further improve the efficiency.We theoretically show that our approximate algorithm can provide a guarantee on the seed quality. Experimental study over three real-world social networks verified the seed quality of our framework, and the coarsening-based algorithm can provide superior efficiency.


Introduction
In recent years, social networks have become prevalent platforms for the spread of product adoption, ideas, and news.Under this trend, influence maximization (IM) problem is becoming popular, which aims to seek  users (referred to as seeds) to maximize the number of influenced users (referred to as influence spread) in the network.Kempe et al. [1] proved that this problem is NP-hard and presented an (1 − 1/)-approximate algorithm by greedily selecting  seed users which has the maximum marginal gain of influence spread.Motivated by their work, a vast amount of studies then focused on improving the effect of influence spread and the efficiency, such as the heuristic-based algorithm PMIA [2] and the sketch-based algorithm IMM [3].
However, many real-world applications such as locationbased word-of-mouth marketing recently have locationaware requirements in IM.In [4], Li et al. focused on a location-aware IM (LAIM) query, which aims to seek  users to maximize the expected influence in a query region.They assumed that the network and the location of users are given beforehand, where an index can be constructed offline, and the target region is submitted online as a query.However, such assumption may not be always satisfied as the network and IM request are given at the same time, which is exactly the scenario discussed in traditional IM works [2,5,6].Besides, existing works in LAIM can only answer the problem towards simple regions such as a rectangle or a circle.However, locations of users are always complicatedly visualized and managed via maps, the atomic regions of which are not necessarily rectangles or circles.Instead, they always show up as various and complex polygons.Therefore, it is meaningful to find an efficient method to address the LAIM problem, by targeting at an arbitrary polygon region, from scratch.Below we provide a running example to elaborate this point.
Example.A company wants to sell a new product in a city.It is obvious that people in this city are potential buyers.In order to propagate this product to the public, we need to find several individuals who are the most influential in the network, and hope that, through their propagation, as many people as possible could know this product and then further purchase it.
In our paper, we present a novel algorithm framework, namely, TarIM-SF (Targeted Influence Maximization with Spatial Filtering), to deal with this problem.Given a locationaware social network and a target region, we firstly adopt a spatial filtering model (SFM) to identify the targeted users.Then we will utilize the elegant sampling approach in latest IM solutions [3] to find seed nodes.In order to further 2 Complexity improve the efficiency, we have coarsened the social network.In all, our contributions in this work are as follows: (i) We relax the target region in LAIM to arbitrary polygons, which is more practical in real applications.
(ii) Secondly, our model can address LAIM from scratch without any assumption for offline processing.
(iii) To the best of our knowledge, we are the first to prove the hardness of LAIM theoretically in detail, and for the large-scale and complex networks, we propose a coarsening-based model that can further improve the efficiency with guaranteed seed quality.
Experiments on real-world datasets Gowalla, Tweets, and Weibo demonstrate that our framework could generate a seed set with theoretically guaranteed quality, which outperforms a series of baseline methods in terms of influence spread quality.Besides, the coarsening-based algorithm can provide superior efficiency.
The rest of paper is organized as follows.Section 2 lists the related studies.Section 3 gives the definition of LAIM with proving its hardness and presents some fundamental knowledge.Afterwards, we discuss the proposed algorithm framework in Section 4. Section 5 shows the theoretical guarantee of seed quality.Section 6 reports the experimental results and some discussion.In Section 7, we conclude the paper.

Related Works
Kempe et al. [1] first formulated the influence maximization problem and proved that it is NP-hard in general, but can be approximated with (1 − 1/ − ) factor.They presented a greedy algorithm with a provable approximation guarantee to solve this problem.However, the greedy algorithm needs to perform the Monte Carlo simulation [7] to obtain the approximate ratio, which has a large time overhead.Furthermore, in order to improve the efficiency and the effect, there has been a large body of research works that can be divided into three types.Simulation-based methods accurately estimate influence by simulating the diffusion process repeatedly with a theoretical guarantee.Leskovec et al. [8] proposed a CELF method with the lazy-forward heuristic, which is originally designed to optimize submodular functions in [9], as well as [10][11][12][13][14]. Heuristic-based methods are developed to avoid using Monte Carlo simulation at the expense of solution quality.For example, Chen et al. [2] proposed to use local directed acyclic graphs to approximate the influence regions of nodes, while [15] restricting the spread of influence into communities and [6] approximating the influence spread using linear systems.Sketch-based methods resolved the inefficiency of Monte Carlo simulations without loss of accurate guarantees.Borges et al. [16] presented a nearly optimal time algorithm for IM under IC model.This method relies on reverse simulations of the diffusion process and builds sketches to estimate the influence function efficiently.In subsequential works, techniques for bounding the sketches' size are developed [3,[17][18][19][20][21] and [3,18,19] are the representative ones that exhibit higher efficiency in all sketch-based methods.Moreover, Liu et al. [22] construct a community-level influence analysis model, instead of focusing on individual-influence, while [23] defining the outer influence of a community and aiming to find the most influential communities, as well as [24] constructing an influential propagation model considering the temporal-interaction between users in the social network.
Recently, more additional demands for IM problem emerged, such as considering the interests of users [25,26], geographical factor, or some factors of time.Especially in order to meet the location-aware requirements in IM, Li et al. [4] proposed a method to solve location-aware IM, which works by seeking  users to maximize the expected influence spread of the query region.Wang et al. [27] considered the distance between two users and defined the distance-aware IM problem.They proposed a priority based algorithm with (1−1/)-approximation ratio.The authors in [28] also studied the DAIM problem, considering the distance between the locations and the users.Zhu et al. [29] proposed Gaussian based and distance based mobility models, to derive the location-aware propagation probability in LBSN.Zhou et al. [30] take users' historical mobility behaviour into account and study the IM problem under O2O model.Li et al. [31] aim to find several seed users to maximize geographic spanning regions (MGSR) in the query region, while Li et al. [32] assume that users have their location preference and solve the IM problem for the targeted users.Furthermore, some works focus on spatial-temporal IM problem [33,34], which aims to find  best trajectories to be attached with an advertisement and maximizes the number of influenced users.Besides the location, the interests/topics of users are also taken into consideration in IM, and [35] proposed an algorithm that returns top- topics related to the query of a user.Su et al. [36] take not only users' interests but also their preference for locations into account, to find the targeted users, and then seek seeds to maximize the influence for targeted users.

Problem Definition
Definition 1 (LAIM).Given a location-aware social network  = (, ) where each node V ∈  is associated with a location (denoted as V.), a budget , and a target region , the location-aware influence maximization (LAIM) aims to find  seed nodes (denoted as ) from , such that the influence spread from  can reach the most number of nodes in .
We show the hardness of LAIM problem under Independence Cascade (IC) model, which is one of the most popular diffusion models [1].Before that, we first define a problem called Subset Cover, which will be utilized in the following content.
Definition 2 (subset cover).There are an element set  = { 1 ,  2 , . . .,   }, a subset   = { 1 ,  2 , . . .,   } of , and a collection of subsets  = { 1 ,  2 , . . .,   } of , and we wish to know whether there exist  subsets in , whose union is equal to   .Theorem 3. The location-aware influence maximization problem is NP-hard for IC model.Proof.From [1], we know that the influence maximization is NP-hard by reduction from Set Cover.Here we can prove that Subset Cover problem above is also NP-hard by reduction from Set Cover problem; the process is as follows.
For S and U in Set Cover problem, we get a subset   = { 1 ,  2 , . . .,   } from U, and we get a new set This process can be completed in polynomial time.When we find a solution A for Set Cover problem, the corresponding subsets   in   can cover all nodes in   ; and if we find the solution of Subset Cover problem, Set Cover problem can also be solved.Based on this, we are able to construct a corresponding directed graph with ( + ) nodes like the proof in [1]: there are node i corresponding to each   in S, node j corresponding to each element   in   , and a directed edge (, ) with activity   , where   = 1 if   ∈   ; otherwise it is equal to 0. The Subset Cover problem is equivalent to deciding if there is a set A in this graph with   ≥ ( + ), where   denotes the influence spread of node set A. Initially, if we find a set A which makes   ≥ ( + ), the Subset Cover problem will be solved, and if all  nodes corresponding to sets in solution of Subset Cover are activated, all t nodes corresponding to set   will be activated.

Sampling Technique.
Borgs et al. [16] introduced a sampling method called RIS, which first constructs a suitable number of sketches from different target nodes reversing DFS (referred to as RR sets R) and then finds out  users as the seed nodes with the maximum coverage of R. The process of constructing RR sets is as follows.
Firstly, given an edge-weighted graph  = (,), we denote  = (, , ) as the influence graph of G, where p is the propagation probability for edges between two user nodes.We delete every edge e in G with probability (1 −   ).After that, we need to randomly choose one node in V and then construct a hypergraph and get the RR set for it.Definition 4 (RR set).Let V be a node in V.A RR set for V is generated by firstly sampling a graph   from  and then taking the set of nodes in   that can reach V.

Spatial Filtering Model.
In order to figure out which nodes fall into region , the most intuitive way is to compare the location of each node with the boundary of , which is costly when  is complex or || is very large.Herein, we will adopt an efficient method for this task.Our method works by comparing the convex hull of nodes with  and iteratively removing those nodes falling out of .Finally, we can end with a group of nodes whose convex hull is inside .In this manner, we avoid enumerating all nodes in V. Notably, for a point set T, we could use Graham Scan method [38] to seek its convex hull.Let  = {V 1 ., . . ., V  .}be the nodes' locations and ( 1 , . . .,    ) be the boundary of  as a point sequence.Then the process of finding target nodes in  is shown in Algorithm 1.

TarIM-SF Framework
Here we describe our TarIM-SF framework in detail.In our LAIM problem, target users change from the whole network in classic IM problem into users in a target region.As the construction of RR sets starts from the target users, it is reasonable for us to construct enough RR sets R over the whole network for users in the target region within our problem and then choose the node set  which maximizes the coverage of R (referred to as  R ()).Based on this idea, we first need to identify the users in the target region before constructing RR sets, which is addressed using the method proposed in Section 3.4.Moreover, in order to improve the efficiency of constructing RR sets, we have coarsened the network using the method proposed in Section 3.3.More details are given in Algorithm 2.
For instance, in Figure 3, given a location-aware social network G,  = 1, and a query region  (such as a triangle), we first use SFM (line 3) to identify the goal users {3, 4, 6}, then we coarsen the whole network, and we get the partition P = { 1 ,  2 ,  3 ,  4 }, where  1 = {1}, 2 = {4, 6, 7},  3 = {5},  4 = {2, 3} (line 4).Next we will sample the coarsened influence graph and construct enough RR sets for the goal users (line 5).In the coarsened influence graph, the probability of partition node  4 being chosen to construct RR set is 1/3, and there is 2/3-probability for node  2 .As a result, we get RR sets: We can see that node  4 has the maximum coverage of R, including  1 ,  2 ,  3 .In turn for the original network graph, we choose one user node randomly in  4 as the seed, such as node 2 (lines 6-10).
In our framework, the first step is to identify the users in the target region , whose complexity depends on the location distribution of all nodes.In case that all users' locations are uniformly randomly distributed, the time complexity for identifying users within the target region is ( 1/3   )
Step 1: A query Step 2: Coarsening Step 3: Sampling under our spatial filtering model, where  = || and   is the number of points for .The second step is to coarsen the network, which requires ((|| + ||)) time, where  denotes the number of random subgraphs sampled from .Afterwards, we use algorithm in [3] to seek solution in coarsened influence network, and the time of this step is spent on constructing RR sets for target region.The complexity of this process is (∑ ∈[R] (  )), (  ) denoting the edges number of the i-th RR set, where |R| is decided by the parameters   and  * in [3], as well as the number of nodes and edges in the coarsened influence graph.

Effectiveness Study
We will subsequently conduct a theoretical study over the seed quality of our algorithm framework for both cases when coarsening is present or not.

Seed Quality without Coarsening.
In [1], it has been proved that, under Independent Cascade model, the result influence function (⋅) is sunmodular.Here we define (⋅) as the target influence spread, which is the number of influenced nodes in target region.It is easy to see that, for any sets S and T,  ⊆ , and any elements V, ( ∪ {V}) − () ≥ ( ∪ {V}) − () ≥ also holds.Hence, the function (⋅) is submodular.
Theorem 6 (see [1]).For a nonnegative, monotone submodular function , let    be a size- set by selecting one element at a time, each time choosing the element which has the maximum marginal function value.Assuming  *  is the set that maximizes the value of  over all -element sets, then (   ) ≥ (1 − 1/) ⋅ ( *  ); in other words,    guarantees a (1−1/)-approximation.
So if we get a seed set   by adopting a greedy algorithm,   is a (1 − 1/)-approximate solution for the locationaware influence maximization problem.Then we will show the performance guarantee when we adopt IMM method [3] as the greedy approach.
Lemma 7 (see [16]).For any seed set S and any vertex V, the probability that a diffusion process from S can activate V equals the probability that S overlaps an RR set for V.
We generate a sizeable set R of random RR sets for the nodes in target region , and for any seed set S, the fraction  R () of RR sets in R covered by S is the unbiased estimator of E(())/  , where   is the number of vertices in target region .In TIM + [17], it has been proved that the solution which covers the maximum number of RR sets provides a (1−1/−)-approximation with at least (1−1/ ℓ ) probability, but the number of RR sets is at least /, where OPT is the maximum expected influence of any size- nodes set in G and  is a function of , ℓ, , and .In IMM, it seeks a tighter lower bound LB of OPT than TIM + .Next, based on the analysis of the performance guarantee in the IMM, we will describe the parameter settings and performance guarantee in our framework.
Let  1 ,  2 , . . .,   be the sequence of generated RR sets for nodes in the target region .Let   be any size- seed set in G and   be random variable that equals 0 if   ∩   = 0 and 1 otherwise; then based on Lemma 7, we have Consider  *  is the size- node set with the maximum expected influence; let  = E[( *  )].From (2), we can get that   ⋅ R ( *  ) is an unbiased estimator of OPT.By Corollary 2 in which we set  = /  and Lemma 3 in [3], we have the following.Lemma 8. Let  1 > 0,  1 ∈ (0, 1), and if  ≥  1 , then   ⋅  R ( *  ) ≥ (1 −  1 ) ⋅  holds with at least (1 −  1 ) probability.
Based on Lemmas 8 and 9, we have the following.Theorem 10.Given any  1 ≤  and any  1 ,  2 ∈ (0, 1) with For the parameters in Theorem 10, we set  1 =  2 = 1/(2 ℓ ), and under this setting,  is minimized when  1 =  2 , and  1 =  ⋅ /((1 − 1/) ⋅  ⋅ ), where In this case,  = (2  ⋅ ((1 − 1/) ⋅  + ) 2 )/( ⋅  2 ), and conversely if we set  =  * /, where the seed set    that covers the maximum number of RR sets is a (1−1/−)-approximation.However, as OPT is unknown in advance, we will find a tight lower bound LB of OPT as IMM method.In the sampling phase in IMM, Lemma 6, Lemma 7, and Lemma 8 in [3] proved that  ≤  and LB is close to OPT.And based on that, we can also prove that  ≤  and LB is a tight lower bound of OPT with a high probability by changing  to   .Theorem 11.With at least (1 − 1/ ℓ ) probability, sampling algorithm in our framework returns a set R of RR sets with |R| ≥  * /, where  * is as defined in (7).
Combining Theorem 10 and Theorem 11, our algorithm framework without coarsening can get a solution   which is a (1 − 1/ − )-approximation with a high probability.

Seed Quality for Coarsening Method.
In this part, we will show the result seed set   can achieve (1 − 1/ − ) ⋅   -approximation, when coarsening technique is adopted to improve the efficiency.Based on the study in [37], for the target region , we will get the following equation: where is the number (sum of weights) of vertices in  that are reachable from S in , and inf  (, ) is the number of vertices in  that S can activate in , which equals () in Section 5.1 and is submodular.
In the coarsened influence graph  = (, , , ), we also have It denotes the sum of weights for nodes in  that () can activate in H.After coarsening the network, we define  = (, ,   ), where   V = 1 if , V ∈   for all  ∈ []; otherwise   V =  V , and inf  (, ) is the number of users in target region  that S can activate in I.
Here, we show the relationship between H and  in terms of influence function through I, which has the same structure as .
Proof.For any u and v in V, "u can reach v through the edges in E with p" if and only if "() can reach (V) through the edges in F with q", since every subgraph (  ,   ),  ⊆ [], is strongly connected.Therefore, it holds that  (,) (, ) =  (,) (, ()).Thus, inf  (, ) = ∑ For  and I, we also can find the relationship between them as follows.
Proof. and I are the influence graph with the same structure, but   ≤    for every edge e.So inf  (, ) ≤ inf  (, ) for any  ⊆ .
For any subgraph [  ] = (  ,   ) of ,   ⊆ , its strongly connected reliability, denoted as ([  ]), is defined the same as in Equation 14in [37] and indicates the probability that [  ] is strongly connected.Let  * and  * be the optimal solutions of size  for  and , respectively.Based on Lemma 12 and Lemma 13, we can sure that inf  (,  * ) ≥ inf  (,  * ), and have Then applying Lemma 14, we have Therefore, our coarsening-based algorithm achieves a (1− 1/ − )⋅  -approximate solution for , where   refers to ∏

Results and Discussion
In this section, we conduct experiments on several real-world datasets to test the performance of the proposed algorithm framework.All algorithms are implemented in C++ and run on Ubuntu 16.10 machine with Intel Core i5-6500 quad-core, 3.20GHz, 16GB RAM.
In the following experiments, we use three location-aware social networks, namely, Gowalla, Tweets, and Weibo.The statistics for the datasets are listed in Table 1 (n represents the number of vertices and m represents the number of edges).By default, we use a randomly selected  for all datasets, and the number of user nodes falling in  is denoted as   .We conduct our experiments on WC model, which is widely used for information diffusion.For the weight of every edge, we set the probability of an edge (, V) as 1/ V , where  V denotes the in-degree of user node V.In TarIM-SF, we set  = 0.1.

Comparison with Baseline.
We evaluate the performance of our algorithm framework TarIM-SF compared with method Assembly in [4] under WC model.In order to estimate the performance in general, we selected three regions with fixed   for each dataset (  ≈ 20 for Gowalla,   ≈ 120 for Tweets, and   ≈ 120 for Weibo) and reported the average performance, while varying  for Gowalla from 10 to 50 and  for Tweets and Weibo from 100 to 500.In Figure 4, we can see that the target influence spread of seeds in our framework is obviously superior to that of Assembly, especially on Tweets and Gowalla.

Effect of Coarsening.
As mentioned above, we adopt coarsening technique so as to improve the efficiency for large-scale social networks.In this part, we did a series of experiments on Weibo, a large-scale network with about a million users.In order to justify how the parameter r in coarsening method will affect the target influence spread of seeds, we report the Relative Error, which measures the gap between the real influence spread of seed set we get and its   estimated influence spread.We set  = 300 and the target region as the whole network.Figure 5(a) shows the relative error is decreasing when  becomes bigger.When  = 16, the estimated influence spread of seeds using our algorithm with coarsening is nearly equal to the eventual influence spread of seeds without coarsening.Figures 5(b) and 5(c) indicate that the running time for coarsening approach is significantly less than that of noncoarsening one, without loss of seed quality.

6.3.
Varying the Size and Shape of .We also conducted another group of experiments by varying  in terms of both size and shape.Specifically, we vary  as triangle, tetragon, and pentagon, respectively.For each shape, we vary the size at several different levels and report the performance of our algorithm (shown in Figure 6, || =   ).It justifies that our algorithm can work on target region with arbitrary polygon shapes.Besides, the time spent on the first phase in our algorithm framework increases slightly as the shape of  varied from pentagon to triangle at the same scale, because nodes in convex hull need to be compared with the polygon queried and the fewer the number of polygon edges, the fewer the comparison times and the less time.

Conclusions
In this study, we present a novel model that can address LAIM with a target region that can be an arbitrary polygon.Our framework uses a spatial filtering model to initially figure out the nodes falling into the target polygon.Afterwards, a coarsening process is conducted over the network.Then, the state-of-the-art sampling algorithm adopted in traditional IM is used to find the solution.We theoretically prove the   influence spread guarantee in both noncoarsened and coarsened cases.Empirical study over three real-world datasets demonstrates that our framework outperforms the baseline algorithm in terms of influence spread and is efficient in largescale networks.

Figure 1 :
Figure 1: Several hypergraphs constructed from   (dotted line represents there is a path from a node to another and the path passes through other nodes; real line represents a directed edge).

Figure 2 :
Figure 2: Two subgraphs of  and the coarsened influence graph H.

Figure 6 :
Figure 6: Influence spread and running time in terms of size and shape of .