^{1}

^{1}

^{2}

^{3}

^{1}

^{1}

^{1}

^{2}

^{3}

Machine-constructed knowledge bases often contain noisy and inaccurate facts. There exists significant work in developing automated algorithms for knowledge base refinement. Automated approaches improve the quality of knowledge bases but are far from perfect. In this paper, we leverage crowdsourcing to improve the quality of automatically extracted knowledge bases. As human labelling is costly, an important research challenge is how we can use limited human resources to maximize the quality improvement for a knowledge base. To address this problem, we first introduce a concept of semantic constraints that can be used to detect potential errors and do inference among candidate facts. Then, based on semantic constraints, we propose rank-based and graph-based algorithms for crowdsourced knowledge refining, which judiciously select the most beneficial candidate facts to conduct crowdsourcing and prune unnecessary questions. Our experiments show that our method improves the quality of knowledge bases significantly and outperforms state-of-the-art automatic methods under a reasonable crowdsourcing cost.

There are numerous information extraction projects that use a variety of techniques to extract knowledge from large text corpora and World Wide Web [

To alleviate the amount of noise in automatically extracted facts, these projects often employ ad hoc heuristics to reason about uncertainty and contradictoriness due to the large scale of the facts. There exists significant work in developing effective algorithms to perform joint probabilistic inference over candidate facts [

In this paper, we study the problem of refining knowledge bases using crowdsourcing. Specifically, given a collection of noisy extractions (entities and their relationships) and a budget, we can obtain a set of high quality facts from these extractions via crowdsourcing. In particular, there are two subproblems to address in this study:

To address these problems, we first introduce a concept of semantic constraints, which is similar to integrity constraints in data cleaning. Then we propose rank-based and graph-based algorithms to judiciously select candidate facts to conduct crowdsourcing based on semantic constraints. Our method automatically assigns the most “beneficial” task to the crowd and infers the answers of some candidate facts based on crowd feedbacks. Experiments on NELL’s knowledge base show that our method can significantly improve the quality of knowledge and outperform state-of-the-art automatic methods under a reasonable crowdsourcing cost.

To summarize, we make the following contributions:

We propose a rank-based crowdsourced knowledge refining framework. We introduce a concept of semantic constraints and utilize it to detect potential contradictive facts. We present a score function taking both uncertainty and contradictoriness into consideration to select the most beneficial candidate facts for crowdsourcing.

We construct a graph based on the semantic constraints and utilize the graph to ask questions and infer answers. We judiciously select candidate facts to ask in order to minimize the number of candidate facts to conduct crowdsourcing. We propose path-based and topological-sorting-based algorithms that ask multiple questions in parallel in each iteration.

We develop a probability-based method to tolerate the errors introduced by the crowd and propagated through inference rules.

We conduct experiments using real-world datasets on a real crowdsourcing platform. Experimental results show the effectiveness of the proposed approaches.

The rest of this paper is structured as follows. We first review related work in Section

Information extraction techniques are widely applied in the construction of web-scale knowledge bases. In this paper, we use Never Ending Language Learner (NELL) [

Early work on cleaning a noisy knowledge base was considered by Cohen et al. [

There also exist many research works that incorporate crowdsourcing into data and knowledge management, such as data cleaning [

We consider an automatically extracted knowledge base as a probabilistic knowledge base, which stores facts in a form of triple (subject, predicate, and object), for example, (Brussel, citycapitalofcountry, Belgium). Each fact

An extracted knowledge base (KB) is a 5-tuple

The definition of

An automatically extracted knowledge base could be very large and noisy. For example, the knowledge vault [

There exist a number of crowdsourcing platforms, such as MTurk and CrowdFlower. In such platforms, we can ask human “workers” to complete microtasks. For example, we may ask them to answer questions like “Is Italy a country?” Each microtask is referred as a human intelligent task (HIT). After having completed a HIT, a worker is rewarded with a certain amount of money based on the difficulty of the HIT. That is, invoking the crowd for knowledge cleaning comes with a monetary cost. In addition, a human worker may not always produce a correct answer for a HIT. To mitigate such human errors, we assign each HIT to multiple workers and then take a majority vote. However, even when majority votes are used, we may still get incorrect answers from the crowd. As a consequence, it is crucial to take human errors into account when designing a crowd-based algorithm.

Given a set of candidate facts to be sent to the crowd, we need to combine them into HITs. For each fact, the crowd needs to verify whether the fact is correct or not. We have four questions as one HIT, where each question contains a candidate fact requiring workers to verify its correctness. Box

A sport team is a group of athletes who play a sport together competitively.

Our method takes an automatically extracted knowledge base as input and identifies a set of true facts from noisy extractions through crowdsourcing. We first introduce the concept of semantic constraints that can be used to detect potential erroneous facts and do inference among candidate facts. And then we propose a score function to measure the usefulness of candidate facts, in order to conduct crowdsourcing. In Section

Integrity constraints are effective tools used in data cleaning. This section introduces a similar concept called semantic constraints that can be used to clean noisy knowledge bases. These constraints can be learned from training data or derived from ontological constraints. The ontological constraints can be seen as axioms or rules in first-order logic. For example, we can represent an ontological constraint (every

We derive semantic constraints according to ten types of ontological relations used in NELL: subsumption among categories and relations (e.g., every bird is an animal); mutually exclusive categories and relations (e.g., no person is a location); inversion (for mirrored relations like TeamHasPlayer and PlaysForTeam); the type of the domain and range of each predicate (e.g., the mayor of a city must be a person); the functionality of relations (e.g., a person has only one birth date); antisymmetric (e.g., if person

We use the following notations:

There are two types of semantic constraints according to the label transitive relation between candidate facts:

Given semantic constraints and a set of candidate facts, we will generate a set of ground rules. A ground rule is a rule containing only candidate facts and no variables. We first instantiate a formula of semantic constraint using ontological relations and candidate facts in the knowledge base. Then, we omit the part of instantiated ontological relations since they are deemed to be true and obtain ground rules containing only candidate facts. For example, (a) in Box

Domain(ceoof, ceo)

Ran(ceoof, company)

Sub(ceo, person)

RSub(ceoof, topmemberoforganization)

RSub(topmemberoforganization, worksfor)

RSub(topmemberoforganization, personleadsorganization)

Rsub(worksfor, personbelongstoorganization)

RMut(topmemberoforganization, organizationleadbyperson)

While contradictive semantic constraints can be used to detect potential erroneous facts, both positive constraints and contradictive constraints can be used to do inference among candidate facts.

In this section, we propose a rank-based method for knowledge refining. We would like to select the most beneficial candidate facts to conduct crowdsourcing under a given budget for a knowledge base. It is obvious that we prefer to choose the facts that the corresponding information extraction system is most uncertain about. In addition, the facts that violate semantic constraints the most are of high risk and important ones to the knowledge base. It is beneficial to verify them via crowdsourcing. In this paper, we will use the contradictoriness to estimate the risk and importance of candidate facts. In summary, we will first evaluate the benefit of candidate facts in terms of improving the quality of the knowledge base by taking both uncertainty and contradictoriness into consideration. Then we will rank them according to their evaluation scores and choose top

The information extraction systems commonly provide a confidence score for each candidate fact, that is, the weight

Information extraction systems usually use many different extraction techniques to generate candidates. For example, NELL produces separate extractions from lexical, structural, and morphological patterns. If the patterns used to extract each candidate fact are provided, this extra information can help us better estimate the probability. We can use a simple logistic regression model learned from training data to predict the probability of each candidate fact being correct [

For example, considering a semantic constraint (rule)

Combining the above two factors, we use the following function to rank candidate facts.

Based on ranking scores, we select a batch of candidate facts to conduct crowdsourcing at a time. Algorithm

In this section, we discuss how to utilize semantic constraints as inference rules to reduce the crowdsourcing cost. The rank-based method discussed above simply selects top

To leverage semantic constraints, we model the selected candidate facts (under a given budget) for crowdsourcing as a graph based on ground inference rules and try to infer the correctness of some candidate facts using the graph model.

Given a set of candidate facts, we build a directed graph

Figure

A sample of graph model.

A straightforward method is to take the candidate fact on each vertex as a question and ask workers to answer the question, that is, whether the candidate fact is correct. If a worker thinks that the candidate fact is correct, the worker returns Yes and No otherwise. Based on the workers’ results, we get a voted answer on each vertex. If majority of workers vote Yes, we color it Green; otherwise we color it Red. Next, we interchangeably use vertex, fact, and question if the context is clear.

This method is rather expensive as there are many vertices in the graph. To address this issue, we propose an effective coloring framework to reduce the number of questions. Algorithm

Obviously, this method can reduce the crowdsourcing cost as we can avoid asking questions for many unnecessary vertices. For example, considering the constructed graph in Figure

An important problem in the algorithm is to select the minimum number of vertices to conduct crowdsourcing, so that all vertices in the graph are colored. We will first formulate the question selection problem and then propose a path-based algorithm and a topological-sorting-based algorithm that select multiple vertices in each iteration to solve the problem.

As we know, we have the basic coloring strategy: if a vertex is Green, then all of its descendants in

Given a graph, the optimal graph coloring problem aims to select the minimum number of vertices as questions to color all the vertices using the coloring strategy.

For example, in Figure

A sample of boundary vertex.

A vertex is a boundary vertex if its color cannot be inferred based on other vertices’ colors. There are four cases in

For example,

Here we use

We can divide the graph

Let

Then we propose a serial path-based vertex-selection algorithm. The pseudocode is shown in Algorithm

The serial path-based vertex-selection algorithm can only publish a single fact to a crowdsourcing platform at a time, which is unable to crowdsource candidate facts simultaneously and results in long latency. To overcome this drawback, we extend the path-based algorithm to support parallel settings, which select multiple vertices and publish the corresponding candidate facts simultaneously to the crowdsourcing platform in each iteration. The pseudo code is shown in Algorithm

However, the parallel algorithm may generate conflicts. For example, if

Note that the maximal matching can be computed in

We design a topological-sorting-based algorithm to improve the time efficiency of the maximal matching. It first computes topological-sorted sets

There are two types of possible errors in our graph-based framework. The first type is caused by workers’ errors and the second type is propagated through inference rules. For example, suppose a candidate fact

The pseudo code of our error-tolerant coloring algorithm in shown in Algorithm

In this section, we evaluate our methods and report experimental results.

Statistic characters of dataset.

Dataset | Category | Relation | Total |
---|---|---|---|

Candidate | 836K | 182K | 1.02M |

Promotion | 354K | 64K | 418K |

Ontological Relation | 18K | 52K | 70K |

Test | 2002 | 2546 | 4546 |

Training | 4777 | 5089 | 9866 |

We calculate contradictoriness scores among all candidate facts and select candidate facts from the test set for crowdsourcing. We use a threshold 0.5 for the confidence score. For crowdsourced data, a fact is treated as correct only when more than half of crowd answers are “Yes.” We compare our methods with other popular methods in terms of the quality, the number of questions, and the number of iterations. To evaluate the quality, we use three metrics, that is, precision, recall, and

We first compare our method with state-of-the-art methods for knowledge refining. Then we evaluate our rank function, question selection strategies, and error-tolerant techniques, respectively.

In order to evaluate the effectiveness of our proposed techniques, we compare our methods Rank, Graph, and Graph+ (graph-based method with error-tolerant techniques) with two recent methods for cleaning automatically extracted knowledge bases, that is, MLN [

Given a budget

Evaluation of proposed methods. (a) Quality comparison of proposed methods for different crowdsourcing budgets. (b) Quality comparison with the state-of-the-art methods.

In this experiment, we evaluate our ranking function, which is a key for selecting crowdsourcing candidate facts in the rank-based method. This function, denoted as U

Figure

Evaluation of different ranking functions.

Precision

Recall

From Section

Evaluation of question selection strategies on the test dataset.

Quality

Questions

Iterations

From Figure

To evaluate our graph-based method (Graph) on reducing the number of questions, we conduct additional simulation experiments on the complete dataset, using NELL beliefs as ground truths and simulating workers with accuracy of 90%. Our experimental results are shown in Figure

Evaluation of question selection strategies on the complete dataset.

Quality

Questions

Iterations

In this section, we evaluate the effectiveness of our error-tolerant solution (proposed in Section

Evaluation of our error-tolerant technique.

Quality

Questions

Iterations

From Figure

We proposed a cost-effective method for cleaning automatically extracted knowledge bases using crowdsourcing. Our method uses a ranking score to select the most beneficial candidate facts for crowdsourcing in terms of improving the quality of knowledge bases. We constructed a graph based on the semantic constraints and utilized the graph to crowdsource questions and infer answers. We evaluated the effectiveness of our methods on real-world web extractions from NELL. Our experimental results showed that our method outperforms both MLN-based and PSL-based methods in terms of

The authors declare that there are no conflicts of interest regarding the publication of this paper.

This work was partially supported by Chinese NSFC (61170020, 61402311, and 61440053), Jiangsu Province Colleges and Universities Natural Science Research project (13KJB520021), Jiangsu Province Postgraduate Cultivation and Innovation project (CXZZ13_0813), and the US National Science Foundation (IIS-1115417).