Finding Optimal Team for Multiskill Task Based on Vehicle Sensors Data

These days, with the increasingly widespread employment of sensors, particularly those attached to vehicles, the collection of spatial data is becoming easier and more accurate. As a result, many relevant areas, such as spatial crowdsourcing, are gaining ever more attention. A typical spatial crowdsourcing scenario involves an employer publishing a task and some workers helping to accomplish it. However, most of previous studies have only considered the spatial information of workers and tasks, while ignoring individual variations among workers. In this paper, we consider the Software Development Team Formation (SDTF) problem, which aims to assemble a team of workers whose abilities satisfy the requirements of the task. After showing that the problem is NP-hard, we propose three greedy algorithms and a multiple-phase algorithm to approximately solve the problem. Extensive experiments are conducted on synthetic and real datasets, and the results verify the effectiveness and efficiency of our algorithms.


Introduction
These days, with the development of sensors (especially vehicle sensors and mobile sensors) [1][2][3], it is increasingly simple to acquire spatial and temporal information [4,5].
Many studies based on vehicle sensors data have been conducted in recent years [6][7][8].As a result, many applications now provide services based on users' real-time spatial information and these are becoming ever popular.Among these applications, some focus on crowdsourcing services that use spatial information.These applications usually require some workers to help an employer to accomplish a task.For example, Uber (https://www.uber.com)organizes drivers and provides users with a convenient taxi service, whereas +Meituan (http://www.meituan.com)provides a credible and fast food-delivery service.This area, called spatial crowdsourcing, is attracting significant attention.
The task assignment problem is one of the fundamental concerns in spatial crowdsourcing.For example, real-time taxi-calling platforms, such as Uber and Didi Chuxing [9], always need to assign each taxi-calling task to a suitable taxi (i.e., a crowd worker).An incorrect assignment may cause taxis to be dispatched to far-away places, which results in a slow response time and the loss of the platform.Many studies on the task assignment problem have been published in recent years [10][11][12].However, most of them only consider the spatial information of tasks and workers, while ignoring the individual variations among workers.Namely, different people may excel or struggle with different tasks, and tasks also contain certain requirements for which some workers may be inadequate.
Take Figure 1 as an example.Suppose a website development task requires coders skilled in .NET, SQL, and HTML to assemble at the location of the origin, and there are three coders available (whose skills are presented in Figure 1).Although coder  3 is located closer to the origin than  1 and  2 , hiring  3 will not help finish the task.In other words, it is necessary to further consider individual variations among different workers and special requirements of tasks.
As in [13,14], each worker is associated with a set of skills representing their strengths.Tasks are also associated with a set of skills representing their special requirements.R-trees [15] are a classical index structure for multidimensional data.Derived from B-tree, the data in an R-tree are stored in leaf nodes and all leaves are located in the same level of the tree.Every internal node contains between  and  child entries, and every leaf node contains between  and  data entries, where  is usually related to the size of disk pages, and  is predefined such that  ≤ /2.The tree is specially structured such that the children of a node overlap with few data from other nodes.Using an R-tree, we can dynamically insert/update/delete nodes, and rapidly search for all nodes located in a given rectangle.
The objective of our problem consists of two parts.First, workers need to move to the location of the task but receive no reward for this movement.In consideration of the workers, we attempt to reduce the gratuitous moving distance.Second, the employer wishes spend the minimum amount necessary to accomplish the task.In consideration of employers, we attempt to obtain a team at the lowest cost, on condition that the skill requirement is satisfied.As the problem definition in Section 2 shows, the objective of our work contains not only the distance between the task and workers, but also the total cost.
Contributions.In summary, our contributions are as follows: (i) We propose a new Software Development Team Formation (SDTF) problem and prove that it is NPhard.(ii) Three greedy algorithms are provided to solve the SDTF problem.(iii) We employ a multiphase algorithm based on R-trees.(iv) We verify the effectiveness and efficiency of the proposed algorithms through extensive experiments on synthetic and real datasets.
Compared with our previous work [16], we propose a novel multiple-phase algorithm by using the index structure of R-trees.Additional experiments are also conducted on synthetic and real datasets.
The remainder of this paper is organized as follows.In Section 2, the problem is formally defined and proved to be NP-hard.In Section 3 three greedy algorithms are provided to solve the SDTF problem.In Section 4, we propose a multiplephase algorithm based on R-trees.Extensive experiments on real datasets are described in Section 5. Previous work related to our problem is presented in Section 6, and the conclusions to this study are presented in Section 7.

Problem Statement
First, we introduce the two basic concepts of a task and a coder.We then formally define the Software Development Team Formation (SDTF) problem.
Definition 1 (task).A task  is defined as ⟨, ⟩, where . is a set of skills that are indispensable to complete the software development task , and . is the location specified to meet up and talk about task , which, for example, can be described by longitude and latitude.
Similar to the definition of a task, a coder is formally defined as follows.
Definition 2 (coder).A coder  is defined as ⟨, , ⟩, where . is a set of skills mastered by coder , . is the location of coder , described similarly to that of a task , and . is the price of coder .
Briefly, a team of coders is feasible for a task if the coders in the team can collaboratively accomplish the task.The skill set of every  ∈  is listed in Table 1.Team  = { We consider a special case of the SDTF problem in which the task and coders are located in the same position, and the skill set of the task is the universal set of all skills.To reduce the weighted set cover problem to the special-case SDTF problem, we observe that each element in  corresponds to a skill in ., each element in   corresponds to a skill in   .,and the weight of   corresponds to the price of   .As the task and all coders are at the same location, for every team , max ∈ |., .| = 0, and we need only minimize ∑ ∈ ..Obviously, there exists a solution to the weighted set cover problem if and only if there exists a solution to the specialcase SDTF problem, and we can obtain an instance of the special-case SDTF problem from the instance of weighted set cover problem in polynomial time.Therefore, the general case of the SDTF problem is NP-hard.

Greedy Solutions for SDTF
In this section, we present three greedy algorithms to solve the SDTF problem.The first two algorithms greedily choose the nearest/cheapest coder who can cover at least one uncovered skill.Because they only consider optimizing part of the objective function, the solution is sometimes not good enough.Thus we propose a third greedy algorithm that considers both price and distance when choosing a new coder.

Price First-SDTF Greedy Algorithm.
The idea of the first greedy algorithm is to repeatedly add the cheapest coder to the team until the team is feasible.The whole procedure of this price first-(PF-) SDTF is illustrated in Algorithm 1.We assume that there exists at least one feasible team.
Considering that skills not in the skill set of the task contribute nothing to the accomplishment of the task, the term "cheapest coder" must be treated carefully.Here, we define the Average Price on Uncovered Intersecting Skills to describe how a coder contributes to the price part of the objective function: where   is the uncovered skill set of task .We can see that APUIS describes how a coder influences the price part of the objective function if we add him/her to the final team.Choosing a coder with lower APUIS means we can satisfy the requirement of the skills with a lower total price.Note that when there is no intersection between the skill set of the worker and the uncovered skill set, APUIS will be infinity.Because we greedily choose the worker with the lowest APUIS, we omit this special case in (1).
In line (1) of Algorithm 1, we initialize an empty team .In lines (2)-( 5), when  is not feasible, we find a coder  who can cover at least one uncovered skill of task  and has the lowest ./|.∩ .| value, add  to team , and update ..Ties are broken by distance first, then arbitrarily.In line (6), we return the resulting feasible team .

Distance First-SDTF Greedy Algorithm.
The idea of distance first-(DF-) SDTF is to repeatedly add the nearest coder to the team until the team is feasible.The framework of DF-SDTF is similar to that of PF-SDTF.In each iteration, we find the nearest coder   who can cover at least one uncovered skill of task ; that is, where   is the uncovered skill set of task .The whole procedure of DF-SDTF is illustrated in Algorithm 2. We assume that there exists at least one feasible team.
In line (1), we initialize an empty team .In lines (2)-( 5), when  is not feasible, we find the nearest coder  who can cover at least one uncovered skill of task , add  to team , and update ..Ties are broken by price first, then arbitrarily.In line (6), we return the resulting feasible team .

Distance Price-SDTF Greedy
Algorithm.The aforementioned two greedy algorithms are not effective, because they only try to optimize part of the objective function.To optimize both distance and price at every iteration, we design a utility function Utility.Given a task , current team , and coder , the definition of Utility is where Using this utility function, we have a third greedy algorithm, Distance Price-(DP-) SDTF.The whole procedure of DP-SDTF is illustrated in Algorithm 3. We assume that there exists at least one feasible team.
In line (1), we initialize an empty team .In lines (2)-( 5), when  is not feasible, we find a coder  who gives the highest utility.Ties are broken by distance first, then arbitrarily.In line (6), we return the resulting feasible team .

Multiple-Phase R-Tree Algorithm
In this section, we introduce an algorithm based on the Rtree data structure.Considering that some previous work has applied R-trees in Nearest Neighbor (NN) searching [17,18], a naïve idea is to use an R-tree to accelerate the NN search in the DF-SDTF algorithm proposed in Section 3.2.However, this simple use of R-trees can only accelerate the search speed and does not help optimize the final cost.As our experiments will show, the DF-SDTF algorithm performs worse than the DP-SDTF algorithm proposed in Section 3.3.The above situation requires us to find an algorithm that is both efficient and effective in solving the SDTF problem.
Our original algorithm derives from an intuitive observation: if we query all nodes located in the square whose centroid is at the location of the task and whose side length is 2 ⋅ , the distance between the task and the nodes in the result set will be at most √ 2 ⋅ .This characteristic provides an applicable tool for the distance part of our objective function.By choosing a rectangle with suitable sides, we obtain a set of candidate coders who are close to the location of the task.The price part of the objective function can also be optimized if we employ a proper strategy to choose the next coder from the candidate coder set.
Based on the above observation, we propose the Multiple-Phase R-tree (MPR) algorithm.The main idea of our algorithm is as follows.
(1) Initialize a new R-tree and insert all coders into the tree.
(2) In each phase, obtain a candidate set of coders by querying all nodes located in the square whose centroid is at the location of the task.
(3) Sort all coders in the candidate set in descending order of APUIS.For each coder, add him/her to the final team  if his/her skills can cover at least one uncovered skill in the task.
(4) If team  is not feasible, return to step (2) and use a square with longer sides.
In detail, we generate the list of side lengths by uniformly dividing the maximum distance between the coders and the task.Given a parameter   denoting the number of phases, we first scan the whole set of coders and calculate the maximum distance between the coders and the task, maxDis.Then, we iteratively start a phase by using a square with side length maxDis/  , 2 ⋅ maxDis/  , . . ., maxDis, until we obtain a feasible team.
The pseudocode of our MPR algorithm is shown in Algorithm 4. First, we initialize the team  and find the maximum distance between the coders and the task in lines (1)- (2).Then, we calculate the step size of the sides between two phases in line (3).In each phase (iteration in lines ( 5)-( 9)), we first query all nodes located in the square whose centroid is at the location of the task and whose side length is 2 ⋅  ⋅ step.We then alternately add coders with the minimum APUIS (lines ( 7)-( 9)).Similarly, ties are broken by distance first, then arbitrarily.In line (10), we return the resulting feasible team .

Evaluation
We applied our four algorithms to synthetic and real datasets.The algorithms were implemented in C++, and the experiments were performed on a machine with an Intel i7-4710mq 2.50 GHZ 4-core CPU and 8 GB memory.
5.1.Datasets.We use real and synthetic datasets to evaluate our algorithms.The real dataset is taken from CSTO (http://www.csto.com/)and includes 2033 active coders.In the CSTO dataset, each task is associated with a set of skills needed to complete a software development task, and each coder is associated with a set of skills and an average price that can be deduced from the history data.As few coders have associated price information (because many coders have not any completed tasks), we analyze the price distribution using coders associated with price information.Except for some expensive coders, the price of a coder is uniformly distributed in the range 0-5000 and is unrelated to the number of mastered skills.As the CSTO data are not associated with location information, we generate coordinates for each coder according to a uniform distribution.
For the synthetic data, based on our observations of the real dataset, we generate the price . of coder  following a uniform distribution.We assume that each coder has 5-25 skills, which is common in practice.The distance from each coder to the task is generated according to a uniform distribution.The statistics and configuration of synthetic data are illustrated in Table 2, where the default settings are marked in bold font.

Number of Phases in MPR Algorithm.
In the MPR algorithm, we introduce a new parameter representing the total number of phases,   .Before conducting experiments on the synthetic and real data, we determined an appropriate value of   to ensure better performance of the MPR algorithm.We first generate a synthetic dataset with the default settings to preexamine how   affects the performance of the MPR algorithm.The results are shown in Figure 2 for   from 5 to 100.According to these results, we use   = 45 in all subsequent MPR experiments.

Experiments on Synthetic
Datasets.The experimental results using the synthetic data are shown in Figures 3 and  4. In this section, we measure the effectiveness and efficiency of these four algorithms and analyze how various parameters affect the results given by each algorithm.
Effectiveness of Proposed Algorithms. Figure 3 shows the effectiveness of our four algorithms.The DP-SDTF and MPR algorithms offer similar performance and outperform both DF-SDTF and PF-SDTF.
Efficiency of Proposed Algorithms. Figure 4 shows the efficiency of our four algorithms.We can observe that although DP-SDTF and MPR have similar cost results, MPR is faster  than DP-SDTF.This is because we use the R-tree to prune some unvalued nodes and accelerate the process of the query.We can also observe how the restriction of skill satisfaction affects the running time of four algorithms.Although PF-SDTF, DF-SDTF, and DP-SDTF all use greedy strategy and their structures are similar, DF-SDTF algorithm consumes more time than that of PF-SDTF and DP-SDTF algorithms.This is because DF-SDTF algorithm only considers the effect of the distance.As a result DF-SDTF needs more coders to make the team feasible, resulting in more iterations than the PF-SDTF and DP-SDTF algorithms.
Effect of . Figure 3(a) shows the effectiveness of varying .As  varies from 0.1 to 0.9, the cost of DP-SDTF decreases smoothly, indicating that ∑ ∈ . contributes more than max ∈ |., .|.Because the DF-SDTF (PF-SDTF) algorithm only considers distance (price), when  is high (low), the performance is similar to that of the DP-SDTF and MPR algorithms.However, as  decreases (increases), the performance of DF-SDTF (PF-SDTF) becomes worse.the default setting of  is 0.5, finding a good team requires distance and price to be considered simultaneously.We can observe that the DP-SDTF and MPR algorithms perform better, with the DF-SDTF and PF-SDTF costing 3 to 4 times more.

Experiments on the Real Dataset.
The experimental results using the real dataset are shown in Figure 5. Figure 5(a) shows the effects of varying , and Figure 5(b) shows the effects of varying |.|.Varying  produces a similar effect as with the synthetic dataset.When varying |.|, the costs of the four algorithms oscillates, probably because of the structure of the CSTO dataset.Unlike the experiments on synthetic data, the MPR algorithm performs worse than DP-SDTF but still outperforms DF-SDTF and PF-SDTF.This is probably because, in real datasets, different skills may make different contributions, leading to a gap between results with synthetic data and real data.
Comparison with the Exact Result.Because the SDTF problem is NP-hard, we only conduct small-size experiments to compare the output of our DP-SDTF and MPR algorithms with the exact solution.The setting is . = 5 and || = 300, where coders are randomly chosen from the real dataset.The experimental results are shown in Figure 6.We can observe that the performance of DP-SDTF is similar to that of the exact algorithm, but the cost of the MPR algorithm is 1.25 to 1.5 times the exact minimum cost.

Conclusion.
From the extensive experiments conducted on both real and synthetic data to validate our four algorithms, we found that DF-SDTF (PF-SDTF) algorithm, which focuses on the distance (price) part of the objective function, performs better with larger (smaller) values of .The DP-SDTF algorithm gives the best performance among the four algorithms discussed here because it considers both parts of the objective function.The fourth algorithm, MPR, accelerates the query process with little increase in the cost, which is more applicable in practice.

Related Work
The SDTF problem tackled in this paper covers the domains of Team Formation and Spatial Crowdsourcing.On the one hand, the SDTF problem can be simplified to the task assignment problem if we ignore the skill constraint.On the other hand, it is exactly the most distinctive requirement that the skills of a team must cover the skills of the task.Previous work related to these two domains is introduced in the following subsections.
6.1.Team Formation.The team formation problem was first proposed in [19].The problem requires a team of workers that (1) its skills satisfy the requirement of the task; (2) the overall communication cost is minimum.In this paper, the NP-hard nature of this problem is also proved.The problem has been extended by associating each worker with a capacity [20], which is the maximum number of tasks assigned to the worker.To solve the capacitated team formation problem, two approximation algorithms with proved guarantees were proposed.Different from [19,20], which only include a single task, the team formation problem has been considered with multiple tasks and workers in both offline and online scenarios [21].While the above-mentioned studies attempt to optimize the overall communication cost, the workload can be balanced among workers by treating the communication cost as a restrictive constraint [22].As the above shows, most studies on team formation focus on skills satisfaction in communicative graphs, while ignoring the influence of spatial information.

Spatial Crowdsourcing.
The problem studied in this paper is an extension of the task assignment problem in spatial crowdsourcing, known as the server-assigned task assignment problem [10,11], in which workers cannot reject the assigned tasks.Recently, task assignment in real-time spatial crowdsourcing has also been studied by the online algorithmic model [12,23].Based on the original task assignment problem, both [24,25] study the conflict-aware task assignment problem, in which tasks may conflict with each other and thus cannot be assigned to the same worker.
In addition, the work [26] not only considers spatiotemporal conflicts of tasks but also schedules the plan that each worker complete tasks [26].Furthermore, Kazemi et al. propose the quality-based task assignment problem [27], which utilizes majority voting techniques to guarantee the quality of task assignment results [28][29][30].
Although [13,14] integrate the task assignment problem and team formation problem and propose a two-level-based framework to solve the problem, there are two main differences between [13,14] and our work: (1) there is no capacity constraint in our work, which means that there are more candidates in the search space; (2) the objective of our work considers both the distance between the task and workers and the overall cost, whereas [13,14] only attempt to minimize the overall cost.

Conclusion
With the development of sensors, particularly vehicle sensors and mobile sensors, spatial crowdsourcing is gaining ever more attention.In this paper, we propose a novel spatial crowdsourcing problem called Software Development Team Formation (SDTF).We prove that SDTF is NP-hard and design three greedy algorithms and an index-based algorithm to solve the SDTF problem.The first two greedy algorithms, DF-SDTF and PF-SDTF, only consider part of the optimization objective, and the performance is therefore below expectations.To overcome the shortcomings of these two algorithms, we design a third greedy algorithm, called DP-SDTF, which considers both parts of the optimization goal.In addition, we develop a multiple-phase algorithm based on R-trees called MPR.The MPR algorithm can accelerate the query process with little increase in cost.We conduct extensive experiments to evaluate the performance of our algorithms.The results show that our DP-SDTF algorithm achieves similar performance to the exact algorithm.

Figure 4 :
Figure 4: Running time on synthetic data.