Refining Automatically Extracted Knowledge Bases Using Crowdsourcing

Machine-constructed knowledge bases often contain noisy and inaccurate facts. There exists significant work in developing automated algorithms for knowledge base refinement. Automated approaches improve the quality of knowledge bases but are far from perfect. In this paper, we leverage crowdsourcing to improve the quality of automatically extracted knowledge bases. As human labelling is costly, an important research challenge is how we can use limited human resources to maximize the quality improvement for a knowledge base. To address this problem, we first introduce a concept of semantic constraints that can be used to detect potential errors and do inference among candidate facts. Then, based on semantic constraints, we propose rank-based and graph-based algorithms for crowdsourced knowledge refining, which judiciously select the most beneficial candidate facts to conduct crowdsourcing and prune unnecessary questions. Our experiments show that our method improves the quality of knowledge bases significantly and outperforms state-of-the-art automatic methods under a reasonable crowdsourcing cost.


Introduction
There are numerous information extraction projects that use a variety of techniques to extract knowledge from large text corpora and World Wide Web [1]. Example projects include YAGO [2], DBPedia [3], NELL [4], open information extraction [5], and knowledge vault [6]. These projects provide automatically constructed knowledge bases (KBs) with massive collections of entities and facts, where each entity or fact has a confidence score. However, machine-constructed knowledge bases contain noisy and unreliable facts due to the variable quality of information and the limited accuracy of extractors. Transforming these candidate facts into useful knowledge is a formidable challenge [7].
To alleviate the amount of noise in automatically extracted facts, these projects often employ ad hoc heuristics to reason about uncertainty and contradictoriness due to the large scale of the facts. There exists significant work in developing effective algorithms to perform joint probabilistic inference over candidate facts [7,8]. Automated approaches have been improved in terms of quality but remain far from perfect. Therefore, effective methods to obtain high quality knowledge are desired. It is easy for human experts to determine whether a fact is correct or not. However, it is impossible to hire experts to correct all of them. Recently, due to the availability of Internet platforms like Amazon Mechanical Turk (MTurk), which enables the participation of human workers in a large scale, crowdsourcing has been proven to be a viable and cost-effective alternative solution. Crowdsourcing is normally used to create labelled datasets to apply machine learning algorithms and becomes an effective way to handle computer-hard tasks [9][10][11], such as sentiment analysis [12], image classification [13], and entity resolution [14]. The limitations of machine-based approaches and the availability of easily accessible crowdsourcing platforms inspire us to exploit crowdsourcing to improve the quality of automatically extracted knowledge bases.
In this paper, we study the problem of refining knowledge bases using crowdsourcing. Specifically, given a collection 2 Computational Intelligence and Neuroscience of noisy extractions (entities and their relationships) and a budget, we can obtain a set of high quality facts from these extractions via crowdsourcing. In particular, there are two subproblems to address in this study: (1) error Detection: how can we effectively detect potential erroneous candidate facts which need to be verified by the crowd? Information extraction systems are able to extract massive collections of interrelated facts. Some facts are correct, while others are clearly incorrect and contradictory. Asking humans to verify all candidate facts is generally not feasible due to the large size of extractions. Hence, one of key challenges is to determine which subset of knowledge should be presented to the crowd for verification. (2) Knowledge Inference: how can we accurately infer consistent knowledge based on crowd feedbacks? Errors introduced from the extraction process cause inconsistencies in the knowledge base, which may contain duplicate entities and violate key ontological constraints such as subsumption, mutual exclusion, inverse, and domain and range constraints.
To address these problems, we first introduce a concept of semantic constraints, which is similar to integrity constraints in data cleaning. Then we propose rank-based and graphbased algorithms to judiciously select candidate facts to conduct crowdsourcing based on semantic constraints. Our method automatically assigns the most "beneficial" task to the crowd and infers the answers of some candidate facts based on crowd feedbacks. Experiments on NELL's knowledge base show that our method can significantly improve the quality of knowledge and outperform state-of-the-art automatic methods under a reasonable crowdsourcing cost.
To summarize, we make the following contributions: (1) We propose a rank-based crowdsourced knowledge refining framework. We introduce a concept of semantic constraints and utilize it to detect potential contradictive facts. We present a score function taking both uncertainty and contradictoriness into consideration to select the most beneficial candidate facts for crowdsourcing. (2) We construct a graph based on the semantic constraints and utilize the graph to ask questions and infer answers. We judiciously select candidate facts to ask in order to minimize the number of candidate facts to conduct crowdsourcing. We propose pathbased and topological-sorting-based algorithms that ask multiple questions in parallel in each iteration. (3) We develop a probability-based method to tolerate the errors introduced by the crowd and propagated through inference rules. (4) We conduct experiments using real-world datasets on a real crowdsourcing platform. Experimental results show the effectiveness of the proposed approaches.
The rest of this paper is structured as follows. We first review related work in Section 2 and introduce basic concepts related to our work in Section 3. Then we describe our proposed approaches in Section 4. We report experimental results in Section 5 and conclude in Section 6.

Related Work
Information extraction techniques are widely applied in the construction of web-scale knowledge bases. In this paper, we use Never Ending Language Learner (NELL) [4] as a case study. NELL starts from a few "seed instances" of each category and relation and generates a knowledge base iteratively. It uses natural language processing and information extraction techniques to extract candidate facts from a large web corpus, using facts learned from the previous iteration as training examples. NELL has four subcomponents that extract candidate facts, namely, Pattern Learner, SEAL, Morphological Classified, and Rule Learner. NELL uses heuristics and ontological constraints to promote candidate facts into a knowledge base, assigning each promotion a confidence value.
Early work on cleaning a noisy knowledge base was considered by Cohen et al. [15]. They considered only a small subset of KB errors. Jiang et al. [8] proposed a method for cleaning knowledge bases at a broader scope using Markov Logic Networks (MLNs). This method performs joint probabilistic inference over candidate facts. To make inference and learning tractable, Jiang et al. surmounted these obstacles with a number of approximations and demonstrated the utility of joint reasoning in comparison to a baseline that considers each fact independently. More recently, Pujara et al. [7] improved the model of Jiang et al. by including multiple extractors and reasoning about coreferent entities. Furthermore, Pujara et al. used probabilistic soft logic (PSL) to avoid scalability limitation of MLNs. Dong et al. [6] employed supervised machine learning methods for fusing distinct information sources by combining noisy extractions from the web with prior knowledge derived from existing knowledge repositories. However, all of above methods are automated algorithms and do not leverage the power of crowdsourcing.
There also exist many research works that incorporate crowdsourcing into data and knowledge management, such as data cleaning [14,[16][17][18][19], record linkage [20,21], schema matching [22][23][24][25], and knowledge acquisition [26,27]. For example, Wang et al. [14] proposed CrowdER to solve the problem of entity resolution via crowdsourcing. Zhang et al. [16] used crowdsourcing to clean uncertain data. Chu et al. [18] proposed KATARA, a data cleaning system that utilizes the power of knowledge bases and crowdsourcing to clean tables. Demartini et al. [20] proposed ZenCrowd which uses a mixed human-machine workflow to solve the entity linking problem. Gokhale et al. [21] studied how to do hands-off crowdsourcing record linkage which requires no involvement of developers. Sarasua et al. [22] studied the problem of ontology matching using crowdsourcing. Fan et al. [23] proposed a hybrid machine-crowdsourcing system for matching web tables. Kondreddi et al. [26,27] developed HIGGINS, a framework for human intelligence games for knowledge acquisition, to expand and complement the output of automated information extraction methods. However, so far, there has been little discussion about how to use crowdsourcing to clean a noisy knowledge base with semantic constraints.
Computational Intelligence and Neuroscience 3 A sport team is a group of athletes who play a sport together competitively.
(1) Is saints a sport team?

Knowledge Bases.
We consider an automatically extracted knowledge base as a probabilistic knowledge base, which stores facts in a form of triple (subject, predicate, and object), for example, (Brussel, citycapitalofcountry, Belgium). Each fact has a confidence score, representing the probability that the corresponding information extraction system "believes" the fact is correct. We formally define an extracted knowledge base as follows. (2) = { 1 , . . . , | | } is a set of categories (or types). Each category ∈ is a subset of E. Each entity ∈ E belongs to one or more categories.
(3) = { 1 , . . . , | | } is a set of predicates. Each ∈ defines a binary relation between one or more pairs of types. For example, the predicate specifies a binary relation between and . We call the types of subject and object domain and range of the predicate, respectively.
For each ( , ) ∈ Π, is a triple ( , , ) representing a fact that relation holds between and , where , ∈ E, ∈ ; ∈ R is a weight indicating the probability that an information extraction system believes the corresponding fact is correct, that is, a confidence score. The definition of implies a concept hierarchy: for any , ∈ , is subclass of if and only if ⊆ . Typing provides semantic context for extracted entities and is commonly adopted by the information extraction systems, so we make it an integral part of the definition. We use ( , ) to denote that is an entity of category and use ( , , ) to denote that relation holds between entities and .
An automatically extracted knowledge base could be very large and noisy. For example, the knowledge vault [6] has 1.6B triples, of which 324M have a confidence of 0.7 or higher, and 271M have a confidence of 0.9 or higher. NELL so far has acquired a knowledge base with over 80M confidenceweighted facts, 2M of which have a confidence of 0.9 or higher. The overall estimated precision of NELL's promoted facts across first 66 iterations is 74%.

Crowdsourcing.
There exist a number of crowdsourcing platforms, such as MTurk and CrowdFlower. In such platforms, we can ask human "workers" to complete microtasks. For example, we may ask them to answer questions like "Is Italy a country?" Each microtask is referred as a human intelligent task (HIT). After having completed a HIT, a worker is rewarded with a certain amount of money based on the difficulty of the HIT. That is, invoking the crowd for knowledge cleaning comes with a monetary cost. In addition, a human worker may not always produce a correct answer for a HIT. To mitigate such human errors, we assign each HIT to multiple workers and then take a majority vote. However, even when majority votes are used, we may still get incorrect answers from the crowd. As a consequence, it is crucial to take human errors into account when designing a crowd-based algorithm.
Given a set of candidate facts to be sent to the crowd, we need to combine them into HITs. For each fact, the crowd needs to verify whether the fact is correct or not. We have four questions as one HIT, where each question contains a candidate fact requiring workers to verify its correctness. Box 1 shows an example of questions we generate as an HIT for MTurk. A brief description of the HIT is shown at the top. To assist workers to understand the fact, we provide a description of the related category or relation and use a human format for each fact.

Methodology
Our method takes an automatically extracted knowledge base as input and identifies a set of true facts from noisy extractions through crowdsourcing. We first introduce the concept of semantic constraints that can be used to detect potential erroneous facts and do inference among candidate facts. And then we propose a score function to measure the usefulness of candidate facts, in order to conduct crowdsourcing. In Section 4.3, we will explain how to leverage semantic constraints as inference rules to prune unnecessary questions. Finally, we will discuss our error-tolerant techniques.

Semantic Constraints.
Integrity constraints are effective tools used in data cleaning. This section introduces a similar concept called semantic constraints that can be used to clean noisy knowledge bases. These constraints can be learned from training data or derived from ontological constraints. The ontological constraints can be seen as axioms or rules in firstorder logic. For example, we can represent an ontological constraint (every ℎ is a n) with a rule, ℎ ( ) ⇒ ( ). Similarly, since a is not a , we can have a following rule, City( ) ⇒ ¬ ( ). We derive semantic constraints according to ten types of ontological relations used in NELL: subsumption among categories and relations (e.g., every bird is an animal); mutually exclusive categories and relations (e.g., no person is a location); inversion (for mirrored relations like TeamHas-Player and PlaysForTeam); the type of the domain and range of each predicate (e.g., the mayor of a city must be a person); the functionality of relations (e.g., a person has only one birth date); antisymmetric (e.g., if person writes book , then cannot write ); and antireflexive (e.g., a company cannot produce itself).
We use the following notations: and for subclass relationships; and for mutual exclusion relationships; and for domain and range relationships; Inv for inversion; for functionality, for antisymmetric, and for antireflexive. There are two types of semantic constraints according to the label transitive relation between candidate facts: contradictive relation and positive relation. The derived semantic constraints are shown as follows.
Contradictive Relation. In semantic constraints of this type, if a candidate fact is correct, we can infer that another candidate fact must be incorrect. Violations of contradictive relations indicate potential errors.
Positive Relation. In semantic constraints of this type, if a candidate fact is correct, we can infer another candidate fact is also correct.
Given semantic constraints and a set of candidate facts, we will generate a set of ground rules. A ground rule is a rule containing only candidate facts and no variables. We first instantiate a formula of semantic constraint using ontological relations and candidate facts in the knowledge base. Then, we omit the part of instantiated ontological relations since they are deemed to be true and obtain ground rules containing only candidate facts. For example, (a) in Box 2 are sample ontological relations defined in NELL. Considering candidate facts (b) in Box 2, corresponding ground rules generated according to semantics constraints are shown (c) in Box 2.
While contradictive semantic constraints can be used to detect potential erroneous facts, both positive constraints and contradictive constraints can be used to do inference among candidate facts.

Ranking Facts Based on Benefit.
In this section, we propose a rank-based method for knowledge refining. We would like to select the most beneficial candidate facts to conduct crowdsourcing under a given budget for a knowledge base. It is obvious that we prefer to choose the facts that the corresponding information extraction system is most uncertain about. In addition, the facts that violate semantic constraints the most are of high risk and important ones to the knowledge base. It is beneficial to verify them via crowdsourcing. In this paper, we will use the contradictoriness to estimate the risk and importance of candidate facts. In summary, we will first evaluate the benefit of candidate facts in terms of improving the quality of the knowledge base by taking both uncertainty and contradictoriness into consideration. Then we will rank them according to their evaluation scores and choose top facts to conduct crowdsourcing.
Uncertainty Score. A threshold is often applied to probabilistic extractions. Facts with a high confidence can be assumed to be correct, while facts with a confidence less than threshold are deemed to be most likely incorrect. The most uncertain facts are those whose probability are closest to threshold . We use ( ) to denote the machine-based probability estimation of a fact being correct. Therefore, we model the uncertainty of a fact as follows.
The information extraction systems commonly provide a confidence score for each candidate fact, that is, the weight in the knowledge base definition. We adopt the weight as the machine-based probability estimation of a fact being correct.
Information extraction systems usually use many different extraction techniques to generate candidates. For example, NELL produces separate extractions from lexical, structural, and morphological patterns. If the patterns used to extract each candidate fact are provided, this extra information can help us better estimate the probability. We can use a simple logistic regression model learned from training data to predict the probability of each candidate fact being correct [28,29]. The features are whether each pattern cooccurs with the candidate fact, and the coefficients reflect the reliability of patterns.
Contradictoriness Score. Based on contradictive semantic constraints introduced in Section 4.1, we can detect inconsistency, errors, and conflicts among candidate facts as the violations of these constraints. The more facts a fact is contradictory with, the more likely it is a potential error. We define the contradictoriness score of a fact as follows.
where ( ) is the number of violated ground rules of when appears in . The addition of 1 ensures the contradictoriness scores are greater than zero.
For example, considering a semantic constraint (rule) violates two ground rules of 1 . Hence, 1 ( ( , )) = 2. Combining the above two factors, we use the following function to rank candidate facts. describes the overall procedure for refining a knowledge base.
Given an extracted knowledge base and a set of semantic constraints (SCs), it first initializes confidences of candidate facts being correct with machine-based estimations and calculates scores using (15)- (18). Then, it selects candidate facts from the knowledge base ( ) to conduct crowdsourcing where is a budget allowed for improving the knowledge base.

Leveraging Semantic Constraints Pruning Unnecessary
Questions. In this section, we discuss how to utilize semantic constraints as inference rules to reduce the crowdsourcing cost. The rank-based method discussed above simply selects top candidate facts to conduct crowdsourcing at a time. However, by leveraging semantic constraints, we can infer the correctness of a candidate fact from other facts without acquiring the intelligence from the crowd. Thus, we can effectively use the budget for crowdsourcing. For example, if ℎ ( , ) is correct, we do not need to crowdsource the candidate fact ( , ) since it can be inferred to be correct based on the inversion constraint (12). If ( , ) is correct, we can infer any ( , ) is incorrect based on the functionality constraint (7).

Graph-Based Algorithm.
To leverage semantic constraints, we model the selected candidate facts (under a given budget) for crowdsourcing as a graph based on ground inference rules and try to infer the correctness of some candidate facts using the graph model.
Definition 2 (graph model). Given a set of candidate facts, we build a directed graph = ( , E), where each vertex in is a candidate fact, E = E ∪ E , where E represents all positive relations and E represents all contradictive relations. Given two candidate facts and , if ⇒ , there is a directed edge ∈ E from to to represent this positive relation; if ⇒ ¬ , there is a directed edge in E from to to represent this contradictive relation. Graph Coloring. Each vertex in has two possibilities: (1) the candidate fact is correct and we color it Green; (2) the candidate fact is incorrect and we color it Red. Initially, each vertex is uncolored. Our goal is to utilize the crowd to color all vertices.
A straightforward method is to take the candidate fact on each vertex as a question and ask workers to answer the question, that is, whether the candidate fact is correct. If a worker thinks that the candidate fact is correct, the worker returns Yes and No otherwise. Based on the workers' results, we get a voted answer on each vertex. If majority of workers vote Yes, we color it Green; otherwise we color it Red. Next, we interchangeably use vertex, fact, and question if the context is clear.
This method is rather expensive as there are many vertices in the graph. To address this issue, we propose an effective coloring framework to reduce the number of questions. Algorithm 2 shows the pseudocode. It first constructs a graph based on ground inference rules (line 1-2). Then it selects an uncolored vertex and asks workers to answer Yes or No for the vertex (line 4). If majority of workers vote Yes, we not only color Green, but also color all of its descendants in ( ( )) Green and color all of their children in color ℎ ( ) and ℎ ( ( )) Red; (8) color ( ℎ ( )) and ( ℎ ( ( ))) Red; (9) else (10) color and ( ) Red; (11) end if (12) end while (13) return colored ; Algorithm 2: Graph coloring. c ( ℎ ( ) and ℎ ( ( ))) Red and their ancestors in ( ( ℎ ( )) and ( ℎ ( ( )))) Red (line [6][7][8]. In other words, for ⇒ , we can infer that is also correct; for ⇒ ¬ , we can infer is incorrect. If majority of workers vote No, we not only color Red, but also color all of its ancestors in ( ( )) Red (line 10). In other words, for ⇒ , we infer that is also incorrect. If all the vertices have been colored, the algorithm terminates. Otherwise, it selects an uncolored vertex and repeats the above steps (line [3][4][5][6][7][8][9][10][11][12].
Obviously, this method can reduce the crowdsourcing cost as we can avoid asking questions for many unnecessary vertices. For example, considering the constructed graph in Figure 1, a naive method is to conduct crowdsourcing for all eight facts. However, if we first conduct crowdsourcing for 6 , as majority of workers vote Yes, we can color 6 and their descendants 7 and 8 Green and color 5 Red without conducting crowdsourcing for its descendants. Then if we continue to conduct crowdsourcing for 4 , as majority of workers vote No, we can color 4 and its ancestor 1 Red.
An important problem in the algorithm is to select the minimum number of vertices to conduct crowdsourcing, so that all vertices in the graph are colored. We will first formulate the question selection problem and then propose a pathbased algorithm and a topological-sorting-based algorithm that select multiple vertices in each iteration to solve the problem.

Optimal Vertex Selection.
As we know, we have the basic coloring strategy: if a vertex is Green, then all of its descendants in are Green but all of its children in and their ancestors in are Red; if a vertex is Red, then all of its ancestors in are Red. We will discuss how to support the case that the two conditions do not hold in Section 4.4.
Definition 3 (optimal graph coloring). Given a graph, the optimal graph coloring problem aims to select the minimum number of vertices as questions to color all the vertices using the coloring strategy. For example, in Figure 2, if we sequentially select vertices 1 , 2 , 3 , 4 , 5 , 7 , and 6 , we should ask seven questions. The optimal crowdsourced vertices are 2 , 3 , 4 , and 6 (highlighted by bold cycles), because the colors of these vertices cannot be inferred based on the colors of other vertices. Next we will study how to identify the optimal vertices. Before that, we introduce a concept for making our explanation easily.

Definition 4 (boundary vertex). A vertex is a boundary vertex if its color cannot be inferred based on other vertices' colors.
There are four cases in : (1) all of its parents have different colors with the vertex; (2) all of its children have different colors with the vertex; (3) it has no parent and its color is Green; or (4) it has no children and its color is Red. In addition, there are two cases in : (1) all of its parents are Red; (2) it has no parents (in-edge).

Computational Intelligence and Neuroscience
Input: = ( , E ∪ E ) Output: All vertices in are colored as Green or Red (1) while there exist uncolored vertices in do (2) Compute disjoint paths in using maximal matching; (3) if there exist paths with length > 1 then (4) V ← select the optimal vertex of the longest path (5) else (6) V ← select an optimal vertex according to (7) end if (8) Crowdsourcing V and color ; (9) Remove the colored vertices; (10) end while (11) return colored Algorithm 3: Question selection: SinglePath.
For example, 6 is a boundary vertex in as its parent 4 has a different color. 8 is not a boundary vertex as its parent 6 has the same color and 8 's color can be inferred based on 6 's color. 6 is also a boundary vertex in as its parent 5 is Red.
Here we use to denote all the boundary vertices in the graph and ( ) and ( ) to denote the boundary vertices in and , respectively. There are overlaps between ( ) and ( ), so = ( )∩ ( ). Ideally, all vertices in should be checked, because their colors cannot be inferred. Thus, the number of vertices checked using any algorithm should not be smaller than the number of boundary vertices in . However, since we do not know the ground truths of all vertices in the graph, we cannot identify the boundary vertices in advance. To address this problem, we propose effective algorithms to identify the boundary vertices in with a theoretical guarantee and use a greedy algorithm to identify the boundary vertices in , respectively. Meanwhile, we note that there are more boundary vertices in because of the limited influence of a vertex in than those in . Hence, we consider firstly the boundary vertices in .
Optimal Vertex Selection in . Given a path in , we can use a binary search method to select the boundary vertices. We initially crowdsource the mid-vertex on the path. Based on the result of the mid-vertex, we determine the next step. There are two situations: (1) If the mid-vertex is colored Green, its descendants' colors can be inferred but its ancestors' colors can not be inferred. Thus, we can crowdsource the next midvertex between the current vertex and the source vertex of the path. (2) If the mid-vertex is colored Red, its ancestors' colors can be inferred but its descendants' colors can not be inferred. Therefore, we can crowdsource the next mid-vertex between the current vertex and the destination vertex of the path. Iteratively, we can find all the boundary vertices. For the path with | | vertices, the number of crowdsourcing vertices is (log | |).
Optimal Vertex Selection in . We greedily select the vertices with no in-edge in and the vertices with the largest confidence value in each contradictive group (i.e., connected subgraph), since only Green vertices can be used to infer colors of its children.

Path-Based Algorithm.
We can divide the graph into a set of disjoint paths (i.e., any two paths have no common vertices). Then we can use the binary search method described above to determine the vertices for crowdsourcing. As the maximum length of a path is | |, the number of crowdsourcing vertices is ( log | |), where is the number of disjoint paths. If = 1, we need to crowdsource log | | vertices.
Finding Disjoint Paths. In order to find the disjoint paths, we transform the graph into a bipartite graph = (( 1 , 2 ), E ), where 1 = 2 = and there is an edge between V 1 ∈ 1 and V 2 ∈ 2 if there is an edge (V 1 , V 2 ) ∈ E . We find maximal matching in , which is a maximal set of edges in where any two edges do not share a common vertex in 1 and 2 . That is, for any two edges (V, V ), ( , ) in the matching, V ̸ = and V ̸ = . Obviously, any two edges in the matching sharing the same vertex in must be on the same path. Based on this idea, we utilize the maximal matching to find the disjoint paths as follows.
Let denote the maximal matchings, 1 denote the set of the first vertices in and 2 denote the set of the second vertices in . Then 2 − 2 is the set of vertices that have no in-edges, and we can take them as the first vertices of paths. For each such a vertex V, if it has an edge (V, V ), we take V as the second vertex in a path. Then we check whether V has an edge (V , V ). Iteratively, we can find the path starting at V. The paths computed using maximal matching satisfy disjoint, complete, and minimal paths [30].
Then we propose a serial path-based vertex-selection algorithm. The pseudocode is shown in Algorithm 3. It computes disjoint paths in using maximal matching and selects the optimal vertex of the longest path to conduct crowdsourcing. When there is no path with length greater than 1, it selects an optimal vertex according to .
Input: = ( , E ∪ E ) Output: All vertices in are colored as Green or Red (1) while there exist uncolored vertices in do (2) Compute disjoint paths in using maximal matching; (3) if there exist paths with length > 1 then (4) ← select an optimal vertex of each path with lengh > 1; (5) else (6) ← select optimal vertices according to ; (7) end if (8) Crowdsourcing in parallel and color ; (9) Remove the colored vertices; (10) end while (11)  The serial path-based vertex-selection algorithm can only publish a single fact to a crowdsourcing platform at a time, which is unable to crowdsource candidate facts simultaneously and results in long latency. To overcome this drawback, we extend the path-based algorithm to support parallel settings, which select multiple vertices and publish the corresponding candidate facts simultaneously to the crowdsourcing platform in each iteration. The pseudo code is shown in Algorithm 4. We first identify the disjoint paths and use the optimal vertex selection strategy discussed in Section 4.3.2 to select one vertex from each path to conduct crowdsourcing in parallel. When there is no path with length greater than 1, we select the optimal vertices according to . Based on the answers of these vertices, we color the graph. Next we remove the colored vertices and repeat the above step until all the vertices are colored.
However, the parallel algorithm may generate conflicts. For example, if is colored Green and is colored Red, then there is a conflict on where ⇒ and ⇒ , because is inferred as Green based on but is inferred as Red based on . To address this confliction, we can use majority voting to vote 's color and randomly choose one if a tie occurred.

Topological-Sorting-Based Algorithm.
Note that the maximal matching can be computed in ( | | 2 ) [30], which is too slow in practice when used for a large knowledge base. To address this issue, we perform a topological sorting on the vertices. We first identify the set of vertices with zero in-degree, denoted by 1 . Then we delete them from the graph and find another set of vertices whose in-degrees are zero, denoted by 2 . We repeat this step until all vertices are deleted. Suppose there are | | sets, 1 , 2 , . . . , | | . Obviously vertices in each have no in-edges (as their in-degrees are 0). Therefore, each can be considered as an independent set.
We design a topological-sorting-based algorithm to improve the time efficiency of the maximal matching. It first computes topological-sorted sets 1 , 2 , . . . , | | in . And then it crowdsources vertices in the middle set (| |+1)/2 in parallel. When | | ≤ 1, it selects optimal vertices according to to conduct crowdsourcing. Based on the results of these vertices, it colors the graph and removes the colored vertices and (| |+1)/2 from the set . It repeats the above step and iteratively colors all vertices. The pseudo code of the topological-sorting-based algorithm is shown in Algorithm 5.

Tolerating Errors.
There are two types of possible errors in our graph-based framework. The first type is caused by workers' errors and the second type is propagated through Input: = ( , E) Output: All vertices in are colored as Green or Red (1) while there exist uncolored vertices in do (2) Select a set of uncolored vertices to conduct crowdsourcing; (3) for each crowdsourced with an answer do (4) if ≥ 0.8 then (5) color and related vertices using coloring strategy; (6) else (7) color Blue; (8) end if (9) end for (10) end while (11) Learn a logistic regression model to the machine confidence scores from Green and Red vertices;  inference rules. For example, suppose a candidate fact is actually incorrect. However, the workers wrongly label it as correct. This error is caused by workers' errors. Consider a contradictive fact of , whose labels are correct. Our graphbased algorithms could wrongly label it as incorrect using inference rules. This error is propagated through inference rules. We will discuss how to address these errors in our framework as follows.

Confidence of Workers' Answer.
To tolerate workers' errors, we assign each candidate fact to multiple workers and aggregate their answers. There are many methods to compute the confidence of workers' answers. We use majority voting as an example and any other techniques can be integrated into our framework. Suppose each candidate fact is assigned to workers and > /2 workers vote a consensus answer (e.g., Yes) and − workers vote the other answer (e.g., No). The confidence of the voted answer is = / .
Error-Tolerant Coloring. For each crowdsourced fact, if the confidence of workers on this fact is high, for example, greater than 0.8, we use inference rules to label related candidate facts; otherwise, we label it uncertain (color the vertex in graph as Blue) and do not use it to infer the labels of other candidate facts. For the Green and the Red vertices, we take them as ground truths as their answers have large confidences. Then we utilize them to color Blue (uncertain) vertices. Specifically, we use facts in Green and Red vertices to learn a logistic regression model on the machine confidence scores (provided by the information extraction system) and predict the labels of facts in Blue vertices.
The pseudo code of our error-tolerant coloring algorithm in shown in Algorithm 6. It uses the coloring strategy only for the vertices with high-confidence answers (line 5) and utilizes the logistic regression model to color the vertices with lowconfidence answers (lines 11-12).

Experiments
In this section, we evaluate our methods and report experimental results.

Experimental Setup
Datasets. NELL [4] generates a knowledge base iteratively. In each iteration, NELL uses facts learned from the previous iteration and a corpus of web pages to generate a new set of candidate facts. NELL selectively promotes those candidates that have a high confidence from the extractors and obey ontological constraints with the existing knowledge base to build a high-precision knowledge base. We use extractions of the 165 ℎ iteration of NELL released by [7] to evaluate our method, containing over 1 M extractions, with a manual labelled test set consisting of 4546 instances and a training set consisting of 9866 instances. There are 70 K ontological relations in total. Table 1 shows the statistics of the data set. The training set can be used to calibrate the confidence scores from the original system. When training data is not available, we can adopt the confidence provided by the information extraction system as the probability.
We calculate contradictoriness scores among all candidate facts and select candidate facts from the test set for crowdsourcing. We use a threshold 0.5 for the confidence score. For crowdsourced data, a fact is treated as correct only when more than half of crowd answers are "Yes." We compare our methods with other popular methods in terms of the quality, the number of questions, and the number of iterations. To evaluate the quality, we use three metrics, that is, precision, recall, and F1. Suppose the set of correct facts is , and the set of facts that an algorithm reports as correct is . Then the precision is = | ∩ |/| |, the recall is = | ∩ |/| |, and the F-measure is 1 = 2 /( + ).
Crowdsourcing on MTurk. We use MTurk for crowdsourcing. We post all candidate facts in the test set to MTurk and record the crowd's answers in a local file . During our experiments, when a method requests to crowdsource candidate facts, we retrieve answers from instead of posting facts to MTurk. This ensures that all methods utilize the same set of crowdsourced results, for the fairness of comparisons. We take four microtasks as one HIT, where each microtask contains a candidate fact. To assist workers to understand the fact, we provide a description of each category or relation and use a human format for each entity (see Box 1 as an example). We pay $0.02 each time a worker completes an HIT and $0.01 to MTurk for publishing each HIT. We assign each HIT to five workers. We require that each worker has an approval rate greater than 95%. This setting intends to ensure that all workers provide reasonably accurate answers to the HITs.

Experimental Results.
We first compare our method with state-of-the-art methods for knowledge refining. Then we evaluate our rank function, question selection strategies, and error-tolerant techniques, respectively.

Evaluation of Our Methods.
In order to evaluate the effectiveness of our proposed techniques, we compare our methods Rank, Graph, and Graph+ (graph-based method with error-tolerant techniques) with two recent methods for cleaning automatically extracted knowledge bases, that is, MLN [8] and PSL [7], using previously reported results on the same evaluation set. We also compare with the default strategy used by the NELL [4] project to choose candidate facts to include in the knowledge base.
MLN [8]. This method defines a Markov logic network (MLN) to perform jointly probabilistic inference over candidate facts. We compare our method against the bestperforming MLN model from [8], which expresses ontological constraints, and candidate and promoted facts trough logical rules. The MLN method reports an output with a 0.5 marginal probability cutoff, which maximizes the F1 score. PSL [7]. This method uses probabilistic soft logic (PSL) to jointly reason candidate facts and identify coreferent entities, which can perform inference more efficiently. The PSL method reports results using a soft-truth threshold 0.55 to maximize F1. NELL [4]. We also compare the default strategy used by the NELL project to choose candidate facts to include in the knowledge base. We take the promoted facts as its result. Given a budget (e.g., 40% of candidate facts), our Rank method selects top candidate facts to conduct crowdsourcing at a time. For the Graph and Graph+ methods, we construct a graph with the top candidate facts and use the topological-sorting algorithm to select questions for crowdsourcing. Figure 3(a) shows a comparison of the overall performance of our Rank, Graph, and Graph+ methods. We report the results of our methods under different budgets. From Figure 3(a), we can see that, given a larger crowdsourcing budget, our method can obtain a higher performance. Graph+ and Rank achieve a similar quality. However, Graph+ asks fewer questions than Rank, as shown in Section 5.2.3. Graph+ outperforms Graph, because Graph+ can tolerate workers' errors by not coloring unconfident vertices and thus avoids enlarging the errors by a wrong colored vertex. Figure 3(b) shows a comparison of the overall performance with the state-of-the-art methods. From Figure 3(b), we can see that MLN and PSL perform well in precision or recall, respectively. Our method improves both precision and recall. Overall, our method improves significantly on F1. With a reasonable budget (above 20% test instances), our method outperforms both MLN and PSL methods in terms of F1.

Evaluation on Rank Function.
In this experiment, we evaluate our ranking function, which is a key for selecting crowdsourcing candidate facts in the rank-based method. This function, denoted as U * C, quantifies the usefulness of a candidate fact by considering both its uncertainty and its contradictoriness with other facts. We compare U * C against following baselines. (1) Method Uncertainty considers only uncertainty scores. (2) Method considers only contradictoriness scores. (3) Method Random selects candidate facts for crowdsourcing randomly. Figure 4 shows the results of F1 using different ranking functions. The U * C method achieves the highest F1. The Uncertainty method achieves the highest recall at the beginning but the speed of improvement slowing down with the increment of the budget. This is because there are Computational Intelligence and Neuroscience many false positives and negatives among candidate facts with confidences around the threshold, which have higher uncertainty scores. The error rate drops quickly when the difference between confidence and threshold increases, while considering contradictoriness can still help detect potential erroneous facts effectively. The method achieves better precision than . consistently performs the worst. which selects multiple vertices from multiple disjoint paths in each iteration. (4) TopologicalSorting: which selects multiple independent vertices based on topological sorting in each iteration. We compare them in terms of the quality, the number of questions, and the number of iterations, shown in Figure 5.

Evaluation on Question
From Figure 5(a), we can see that the four methods achieve the similar quality, because different question orders do not affect the quality based on inference rules. From Figure 5(b), we can see that the two parallel algorithms Multipath and TopologicalSorting crowdsource a few more questions than SinglePath. This is because Multipath may crowdsource vertices with ancestor-descendant relationships and TopologicalSorting may crowdsource vertices with the same descendants which can be avoided by our serial algorithm SinglePath based on the inference rules. Topological-Sorting outperforms Multipath because TopologicalSorting crowdsources independent questions in each iteration while Multipath may crowdsource dependent questions. SinglePath outperforms Random and reduces the number of questions. This is because SinglePath can effectively identify the boundary vertex using the optimal vertex search strategy. From Figure 5(c), the two parallel algorithms Multipath and TopologicalSorting significantly outperform SinglePath and Random as they crowdsource questions in parallel.
To evaluate our graph-based method (Graph) on reducing the number of questions, we conduct additional simulation experiments on the complete dataset, using NELL beliefs as ground truths and simulating workers with accuracy of 90%. Our experimental results are shown in Figure 6. Figure 6 shows that our graph-based method crowdsources fewer questions than our rank-based method (Rank). It saves even more than 30%, comparing with the rank-based method. This is because we can utilize the inference rules to prune many candidate facts that do not need to be crowdsourced. The rank-based method achieves a higher quality at the expense of crowdsourcing many more questions. Besides, the graph-based method only involves a few iterations, because it can crowdsource many questions in parallel.

Evaluation on the Error-Tolerant Solution.
In this section, we evaluate the effectiveness of our error-tolerant solution (proposed in Section 4.4) by comparing two algorithms: extends Graph to tolerate errors. We use simulated workers and conduct evaluation under different accuracy levels of crowdsourcing workers (i.e., 70%, 80%, and 90%) on test dataset. We compare Graph+ with Graph in terms of quality, the number of questions, and the number of iterations. Our experimental results are shown in Figure 7. From Figure 7, we can see that Graph+ achieves a higher quality than Graph, because it can tolerate the errors introduced by crowdsourcing workers and avoid error propagation along the inference rules. Graph+ significantly outperforms Graph for low-quality workers. With the increment of the accuracy level of workers, the improvement decreases. On the other hand, Graph+ crowdsources a little more questions than Graph. This is because Graph+ does not utilize the inference rules for some facts, so that it reduces the number of inferred facts. From Figure 7(c), we can see that the two methods have the same number of iterations. This is expected, since the only difference between Graph+ and Graph is that Graph+ does not infer the answers for some unconfident facts. The accuracy level of crowdsourcing workers has little impact on the number of questions and the number of iterations for both methods, because the number of questions and the number of iterations are determined by the graph structure. Therefore, we can use the error-tolerant technique to improve the quality of the knowledge base.

Conclusions
We proposed a cost-effective method for cleaning automatically extracted knowledge bases using crowdsourcing. Our method uses a ranking score to select the most beneficial candidate facts for crowdsourcing in terms of improving the quality of knowledge bases. We constructed a graph based on the semantic constraints and utilized the graph