Ranking Biomedical Annotations with Annotator's Semantic Relevancy

Biomedical annotation is a common and affective artifact for researchers to discuss, show opinion, and share discoveries. It becomes increasing popular in many online research communities, and implies much useful information. Ranking biomedical annotations is a critical problem for data user to efficiently get information. As the annotator's knowledge about the annotated entity normally determines quality of the annotations, we evaluate the knowledge, that is, semantic relationship between them, in two ways. The first is extracting relational information from credible websites by mining association rules between an annotator and a biomedical entity. The second way is frequent pattern mining from historical annotations, which reveals common features of biomedical entities that an annotator can annotate with high quality. We propose a weighted and concept-extended RDF model to represent an annotator, a biomedical entity, and their background attributes and merge information from the two ways as the context of an annotator. Based on that, we present a method to rank the annotations by evaluating their correctness according to user's vote and the semantic relevancy between the annotator and the annotated entity. The experimental results show that the approach is applicable and efficient even when data set is large.


Introduction
Annotations are allowed in most online biomedical databases like NCBI (http://www.ncbi.nlm.nih.gov/), UCSC Gene Browser (http://genome.ucsc.edu/), GDB (http://www.gdb.org/), DDBJ (http://www.ddbj.nig.ac.jp/), and so forth. Shared annotations are becoming increasingly popular in online communities. It is a fundamental activity and plays an important role in the normal research community, with which researchers can explain and discuss the experimental data and share their discoveries [1][2][3][4][5]. As shared comments on documents, pictures, videos, and other annotations, it is also an important data source for biomedical researcher, because of its implying additional facts and annotator's opinions about the biomedical entity. As an example, researchers discovered information about a new protein family with annotations in Flybase [6] and UniProtKB/Swiss-Prot [7]. Now, more and more researchers recognize that it is important to attach and analyse annotations on biomedical entities.
As an open community, there may be many annotations attached with a single biomedical entity. Thus, a question of how to rank the annotations so that users can spend the least time to get the most useful information arises.
Ranking annotations is important and useful for an online biomedical community. As known, biomedical research is active and knowledge about the biomedical entity can be renewed every day. Many of the new discoveries appear in form of annotations. To follow the latest thinking and discovery, researchers will spend much time to view these annotations. A ranking module can help them to retrieve high quality annotations quickly and improve efficiency of the discussions. Rankings also encourage users to publish correct and validated opinion and materials about the biomedical data, so that the community will be more active and become a more important data center and discussion platform.
Ranking reviews, which can be viewed as a type of annotation, are a common problem in many e-commerce and news websites [8,9]. Popular previous methods are mostly based on voting or scoring. Unfortunately, voting and scoring cannot avoid spreading wrong opinions, because users would like to agree with the most popular reviews, even if they do not know whether it is right or not. As a result, useless, even spiteful, reviews constantly appear in the top position in many websites.
As we know, quality of a scientific annotation depends in part on how much the annotator learns about the biomedical entity. The more knowledge the annotator has, the more correct his annotations can be, thus, the more useful to the data user. For example, as for the H1N9 virus, annotations from an astrophysicists are normally with lower correctness than those submitted by a biologist who concentrates on bird flu. User's knowledge is indicated by his semantic background such as working experience, study, and research. If given user is viewed as an object, the semantic background will be a composite of all attributes that describe the user or his related objects and so the biomedical entity can be described. We say that a biomedical entity and a user are semantic related if their semantic backgrounds are partly matched. Obviously, the more they matched, the more the user may learn about the entity.
In the scientific community, an obvious fact is that the annotator's knowledge can be reflected in papers he published and approaches he focused on, which can be obtained from the Internet or other public data source. With such background data, how the annotator may learn about the entity he annotated can be deduced. Besides, accepted historical annotations do also reflect the annotator's knowledge about the annotated entity. If an annotator always contributes high quality annotation to entities with the same attributes, we can say he is familiar with other such entities. In this paper, we propose a weighted and concept-extended resource description framework (RDF) [10] to represent an annotator and a biomedical entity. For any given pair of annotator and biomedical entity, a RDF graph will be created, where the annotator is the root node, attributes of the entity and its one-step extended concepts are the leaf nodes, and each edge is assigned a weight denoting how much the root node learns about the target node. The weight will be evaluated by their cooccurrence in credible web data. On the other hand, frequent patterns of the biomedical entities that was historically annotated by given annotator will be mined. Suppose there is no malicious user, people only annotate biomedical entity that they know. Both the weight and the matching degree of the annotated entity to the frequent patterns are explained as the semantic relevancy. Accordingly, we present a method to rank the annotations by evaluating their correctness with the semantic relevancy between the annotator and the biomedical entity.
Organization. Section 2 is related works. Section 3 introduces the weighted RDF graph model and related concepts. Section 4 presents two main works of this paper. One is how to initialize RDF graph of an annotator and a biomedical entity by web information extraction, including details of computing weight for an annotator's RDF by association mining open credible web information. The other is the algorithm for mining frequent item of historical annotated biomedical entities. Section 5 shows formulas evaluating correctness of a new annotation. Section 6 states experimental results. And last section is the conclusion.

Related Work
Evaluating and ranking biomedical annotations are new problems. The most similar researches are ranking reviews, estimating quality of web content, and opinion strength analysis.
Ranking reviews or other web content has always been a complex problem and attracts renewed research interests in many fields, especially as web plays an increasing important role in delivering and achieving information for many people. Most previous methods are based on user's reputation, wordof-mouth, webpage links, and the other types of user's voting [8,9,[11][12][13][14]. Ai and Meng proposed a method based on weighted fan-in page links and copies to recommend recruitment advertising [11]. It has a viewpoint that the more the users believe and the more dependable the websites are, the higher the quality of the advertisement will be. Largillier et al. present a voting system for news articles using statistical filter and a collusion detection mechanism in [8]. It is reasonable to rank web content according to author's reputation and user's voting in some applications. The former is unworkable when the user does not have enough historical annotations, while the latter cannot exclude propagation of rumors. In this paper, we try to evaluate annotation's quality from the new perspective of the semantic relation between annotators and the annotated biomedical entities, which is, to the best of our knowledge, scarcely considered by previous approaches. In biomedical domain, correctness of user's annotations largely depends on annotator's knowledge about the annotated entities. Semantic relevancy between them plays a critical role in the quality evaluation. Our method is more convincing.
Some prior works try also to discover inherent relationship between data and its users by data mining techniques [15][16][17][18][19][20][21][22]. They can be classified into three categories: statistical methods based on cooccurrence of terms [16], machine learning techniques [17], and hybrid approaches of them [18]. Staddon and Chow studied online book reviews of http://www.amazon.com/ and proposed a method of quality evaluation by mining the association rules between book authors and book reviewers [15]. In [22] the authors proposed three models to evaluate quality of Wikipedia articles by measuring the influence of author's authority, review behavior, and the edit history on quality of the article. These researches also try to discover semantic relationship between data and its users, but they did not consider textual content of reviews or other online opinions [18][19][20], and their criteria are simple; for example, association relationships are defined as the cooccurrence of the author's name and the annotator's name on web in [15]; as a result, they cannot reveal comprehensive semantic relevancy. We describe the entities by their entire semantic context with their attributes and related biomedical entities and based on that, we can analyze multidimensional semantic relationships between biomedical entities and their annotator. Still, we parse the textual content of the annotation and highlight attributes mentioned in it when matching patterns and evaluating its correctness. Other related works are biomedical web information extraction, biomedical text mining, and biomedical entity recognition [23][24][25][26][27][28]. They are related but independent problems. We did not propose new algorithms for those problems and we did not develop a related tool, but we applied existing methods and applications. You can find some performance trials on the website of the Biocreative group (http://www.biocreative.org/) [23], ontology-driven term extraction service for biomedical text on the National Center for Biomedical Ontology (NCBO), and biomedical text mining applications developed by several academic groups and other organizations [24][25][26][27][28][29].

Weighted RDF Graph and Concepts
RDF is a graph based framework for representing concepts on the web by linking its concrete syntax to its formal semantics. In RDF, any expression is a triple of a subject, a predicate, and an object, which can be illustrated by a node-arc-node linked graph as shown in Figure 1. Node represents a subject or an object, and directed arc with a predicate represents relationship between them.
A biomedical entity can be viewed as a RDF subject; its attributes and concept field can be looked at as its objects. Figure 2 shows the RDF graph of protein structure 1J1I in RCSB, whose main features include molecule, protein sequence, function, and authors. Attribute nodes can be extracted from the online biomedical databases and their linked credible web sites. Here, we say that a node is an attribute node if its outdegree is 0, and the others are entity nodes. Tag of an entity node is composed of type name and ID of the entity in form of typeName:entityID. Attributes nodes will be extracted as more as we can so that an entity can be specified more exactly.
An annotator can also be viewed as a RDF subject, and biomedical entities he/she annotated can be its objects. Annotators may have many attributes, but we only consider those locally described and those related to the biomedical entity. We use two types of RDF graph to specify an annotator. One is named annotator's RDF graph whose composing details are present in Section 4.1, and the other is a set of frequent patterns of his/her historical annotated entities. In the RDF graph of annotator , the annotator is the root node, the biomedical entity and its related concepts are the annotator's objects node, and weight on edge pointing to node , which is marked as , is initialized as the correlation degree of and . Instead of weight, frequency and correctness are attached to each pattern, indicating their semantic relevancy.
Different from others, scientific data has complicated concept background. It can be a node in a complex relation network. There is a high possibility that people learning will also learn about 's subconcepts, 's father concept, or 's related concepts. For example, an annotator who knows many of Trichophyton tonsurans and Trichophyton schoenleini may also know about Trichophyton rubrum, because they all a type of mycosis causing similar tinea. Intern weight will be calculated for such possibility.
Definition 1 (intent weight). Suppose annotator learns about concept 1 with weight of , is a father concept or a related concept of , and there are − 1 other concepts 2 , 3 , . . . , who are also 's subconcept or related concept, but does not indirectly know about them; then weight on edge pointing to (2 ≤ ≤ ) in annotator graph of is / . Such weight is called intent weight of against 1 , marked as 1 ... .
Total intent weight of a concept in u's RDF graph is defined as follows: Here, is father or related concept of , is number of concepts whose relationship with is identical to that of with , and the relationships are defined in open biomedical databases such as FACTA+ and Go Terms.
Definition 2 (RDF path). (1) If there is an edge between an entity node and an attribute node , we say that / / is a RDF path between and . (2) If there is a RDF path between entity node and and an edge between entity node and , we say that / is a RDF path between and . The first node is root node of a RDF path. And pattern path is a RDF path without entity node value.

Definition 3 (prefix path). Given a RDF path or a pattern path
, the subsequence from the root node to edge pointing to a nonroot node is a prefix path of in .
Two RDF paths with identical prefix path are conjugate. Conjugate RDF paths can be merged into a sub-RDF graph and conjugate sub-RDF graphs can be merged into a bigger sub-RDF graph when merging the identical ancestor nodes.
Given two RDF paths and , if there is a RDF path in , where = , we say that ⊂ . Similarly, Given two sub-RDF graphs 1 and 2, if, for all ⊂ 1 ( is a RDF path), ⊂ 2, we say that 1 ⊂ 2, and if 1 ⊂ 2 and 2 ⊂ 1, we say that 1 = 2.
Likewise, two pattern paths with same prefix path are conjugate. Two conjugate pattern paths can be merged into a subpattern RDF graph. And a pattern path can belong to a pattern RDF graph , if it is equal to a path in the graph. And for any two pattern RDF graphs 1 and 2, if, for all ⊂ 1 ( is a pattern path), ⊂ 2, we say that 1 ⊂ 2, and if 1 ⊂ 2 and 2 ⊂ 1, we say that 1 = 2.
Additionally, let us define some symbols used as follows.
(i) pp cr, | = is a frequent pattern path of user from biomedical entity to with correctness cr and frequency and is value of attribute . Similarly, pp | = is a path of user pointing to with weight and is value of attribute . (ii) cr is a frequent pattern of user on attribute with correctness of cr, which is composed of frequent pattern paths.

Building Annotator's RDF Graph
In the following, Section 4.1 states details of composing annotator's RDF graph and computing weights by association mining open credible web information. And Section 4.2 presents frequent mining algorithm.

Initializing Annotator's RDF Graph with Web Information.
Too much information can be extracted from the huge Internet, but only those of the biomedical entity and the annotation are useful in this application. Given an annotation ⟨ , , ⟩ where is the annotator in form of a RDF node or a RDF graph, is RDF graph of the biomedical entity, and is the annotation, complete RDF graph of is comprised of the following: (i) , (ii) , (iii) an edge from the root node of pointing to .
Here (1) is initialized as an entity node when no local information can be used or a RDF graph generated according to the annotator's background data from the online database itself; (2) is initialized as stated in the following.
Generating RDF Graph for a Biomedical Entity. RDF graph of a biomedical entity is initialized according to what is described in the online database. In our experiments, we created by the following steps.
(1) Recognize id (e.g., DOI) and type (protein, virus, etc.) of the biomedical entity with predefined keyword or normal structure and compose its entity node with tag of "Type:id. " (2) Extract each head item as an edge from predefined module such as "molecular description" and "experimental detail" and extract the value of the item as its attribute node or compose another level of entity nodes if the module contains several items and draw edges from the entity node to the attribute node.
(3) Extract family classification according to the linked database on the page like Go Terms, look one step more into the detail of the linked database, recognize relationships between entities (e.g., mapping a protein to an organism or finding protein of the same family), draw RDF graph for them, merge the RDF graphs of different linked databases, and eliminate duplicate RDF paths. Figure 3 shows a segment of the information we will extract from the online database, and the circled items will be extracted as edge and their value will be extracted as attribute nodes. Figure 4 shows an example of one-step extension of the biomedical entity's related concept to FACTA+.
Annotation Analysis. Bioconcepts in the annotations can be extracted by biomedical text analysis tools like GENIA [29] and the others. These concepts are normally the annotation's topic. We extract bioconcepts and their attribute names in an annotation; here the attributes names can be recognized by patterns "XX of bioconcept" or "bioconcept's XX. " For each concept, we draw an entity node and an edge for each of its attribute names even without attribute value. Merge and marked out the RDF graphs of the annotation into that of the biomedical entity . If they cannot be merged, draw an edge from the annotator to its root nodes without weight.
Weight Calculating. We assign the weight on an edge will be assigned as the co-occurance of the annotator and the edge's target node in credible open data sources, such as news/talks/papers/personal pages published by predefined credible organizations, known proceedings, and websites. In the experiment, we use Google to search the news, talks, and personal pages, while Anne OTate [30] and PIE [31] to search papers on PubMED and MEDLine. At present, we did not consider the situation of different concepts inferring with the same biomedical entity, which is another scientific problem known as the biomedical text mining and clustering.
Suppose term of the annotator is 1, term of the node is 3, and term of the edge pointing to is 2; then weight on the edge from web is defined as follows: is an attribute node ∑ 1 is not an attribute node.
Computational and Mathematical Methods in Medicine  Here ( 1 ∧ 2) is the count of web pages that include 1 and 2, and is an object node that points to. Considering the fact indicated by intent weight 1 , weight on the edge from web is finally defined as follows:

Mining the Frequent Entity Patterns.
Annotator's knowledge about a biomedical entity can also be inferred by his historical annotations. In this section, we will present an algorithm to discover frequent features of the historical annotated entities with correctness larger than 0.6. The algorithm will consider not only direct attributes of the entity, but also that of its one-step extended related concepts. As illustrated in Figure 5, firstly, the algorithms classify all annotations according to their annotator and then cluster each subset of annotations against their correctness withmeans. And correctness of each annotation in the cluster will be viewed as that of the cluster center. Lastly, frequent patterns are mined over biomedical entities in each cluster. Several questions arise here. First, because of the classification and cluster, the input data set can be too small to produce any patterns. The algorithms use Laplacian smoothing to solve it. Second, the algorithms can bring too much frequent patterns, while some of them can be included in or similar to another one. The algorithm uses Rule 1 to merge those that describe the same owner and the same attribute but with different attribute values and Rule 2 to merge the same patterns but with different correctness. Third, the data sets can be improperly clustered so that frequent pattern cannot be found. The algorithms use a new round of cluster and frequent pattern mining until mining results do not change.
Frequent sub-RDF graphs mining is the key step in the whole algorithm (step 2.3 of Algorithm 1). It takes the pattern paths of the entities as the items. Both the initial and final results are initialized as set of the frequent items obtained by the first round scan, and the result set is repeatedly refreshed by replacing each element with its one-item extension if the extension is also frequent. As shown in Figure 6, in the first round extension, each element in result set will conjunct with each element in initial set; for example, conjunctive of 1 and 2 is also frequent, so 1 and 2 will be replaced by 1

Ranking Annotation
In this section, we propose an algorithm to evaluate correctness (quality) for an annotation ( , ) of biomedical entity from user under different situations: (1) isdirect semantically related to ; (2) is an entity node in RDF graph of or matches at least one frequent entity pattern of on ; (3) has annotated another biomedical entity which is similar to ; (4) has been annotated by other users who are similar to ; (5) has never annotated any entity and has never been annotated. Obviously, annotator is semantic related to the annotated biomedical entity in the first two situations, especially 100% semantic relevant in the first one. We will give formulas to evaluate correctness of annotations for the two situations in Section 5.1., while problem of computing correctness in the last three situations is called a "new user" problem, which will be solved by borrowing the credibility of its nearest neighbor. And details will be stated in Section 5.2. Totally, annotations will be ranked decreasing according to evaluating results of all annotations on the biomedical entity. Besides the semantic relationship, we also consider user's voting and historical annotations on similar annotated biomedical entities from similar annotators when computing credibility of annotations. User's voting is a direct parameter for the agreement degree. And for new user problem where no semantic relationship exists, similar historical annotations can be borrowed to estimate the annotation's correctness.

Evaluating When Semantic Related.
When annotator is an attribute node in the RDF graph of the biomedical entity or is an attribute node of , we say that they are semantic related to each other. More strictly, for an annotation ( , ), suppose 1 is RDF graph of annotation , 2 is RDF graph of annotator , 3 is RDF graph of biomedical entity , and Ω is a set of frequent patterns of , whose forming methods are all stated in Section 4; if ∃ a prefix path pr1 ∈ 3 and a prefix path pr ∈ 1 that pr1 = pr and one of 3'sentity node is , we say that is direct semantically related to . Normally, if (1) there is a prefix path pr ∈ 1, where pr ∈ 2, or (2) there is at least a path in 3 matching a frequent pattern in Ω, we say that is semantically related to .
Given an annotation ( , ), if user is direct semantically related to biomedical entity and supposing that is a set of voting score on , where only the max one of each user's voting will be kept, then correctness acr of is Here, | | is the number of the element in set . Furthermore, suppose 1, 2, 3 is RDF graph of , , and corresponding, and Ω is a set of frequent patterns of , if is non-directly but semantically related to , correctness of is decided by the weight of 1 in 3 and the max matching degree of to a frequent pattern in Ω. Supposing that cr isa frequent pattern of with correctness cr andsupposing that cr has RDF pattern paths, among which pattern paths (suppose 1 cr1, 1 , . . . , cr , ) match both a RDF path of and a prefix path of 1, then the feature matching degree of and cr is defined as follows: cr , is correctness of pattern path pp .
And supposing that there are paths of 1 belonging to 2 with weight 1, . . . , on each edge pointing to the attribute nodes, then correctness acr of is defined as follows: ( ∈ Ω and match a prefix path of 1) .

Evaluating for "New User".
When there is neither annotator's RDF graph nor frequent patterns indicating that the annotator and the entity are semantically related, but has annotated other biomedical entities or has been annotated by other user, we can use the nearest neighbor to evaluate correctness of annotation ( , ).

Computational and Mathematical Methods in Medicine
For a given biomedical entity , its nearest neighbor is a set of biomedical entity in which each element satisfies the next condition: Here, | 2 | is number of RDF paths that belong to both o and , | | is number of paths that belong to , | | is number of paths that belong to , and is threshold defined by user.
Similarly, nearest neighbor of a given user is also a set of users among which each user satisfies the following conditions: Here, |appear( )| is number of unique appearance of in papers, public talks, news, and so forth, especially papers in PubMED and MEDLine, while |appear( , )| is the coappearance of and in the above data sources. | cr> | is number of biomedical entities that was annotated by both and with correctness larger than user defined threshold , | cr> | is number of biomedical entities that was annotated by with correctness larger than user defined threshold , | cr> | is number of biomedical entities that was annotated by with correctness larger than user defined threshold , and is threshold defined by user. Now, given an annotation ( , ), if user is not semantically related to biomedical entity , supposing that is a set of unique user's voting score on , supposing that is a set of users who are the nearest neighbor of , and is a set of biomedical entity who are the nearest neighbor of , then correctness acr of is is not empty is empty and is not empty.
Here, | | is also the number of the elements in set . acr is correctness of annotation submitted by user on biomedical entity . Lastly, given an annotation ( , ), if user never submits any annotation and biomedical entity has never been annotated and supposing that is a set of voting score on , where only the max one of each user's voting will be kept, then its correctness acr is defined as

Experimental Evaluation
There are three works in this paper: (1) extracting web information to compute relevancy of an annotator and a biomedical entity, (2) frequent pattern mining of the historical annotations, and (3) evaluating correctness of the annotations. We will state in this section how we use the existing tools to extract web information and get our experimental data and show performance of the frequent pattern mining and ranking evaluations.  Table 1 and the others are randomly generated: random annotator, random biomedical entity, and random annotation with random correctness. As shown in Table 1 1000 of the annotators are classified as 9 types. Each type is designed to contribute certain number of annotations with correctness in certain range. To test the cold-start problem, several users are designed to contribute 5 or below annotations. On the other hand, to ensure the patterns can be found, at least five of each type of users will give annotations on 5 to 15 biomedical entities with common features.
As for the web information, we presearched and stored their weights in database for the 20000 pairs of users and biomedical entities. First, each biomedical entity will be one-step extended in FACTA+ to get its related concepts. Then, to evaluate the weight, we get information by two ways: searching Google for news, talks, and homepages and searching PIE the search [31] for papers and other documents. To search Google, we write a C# program which autosearches the predefined credible websites with Google service using keywords including name/affiliation of the annotator, scientific name of the biomedical entity, extended concept, or attribute name of the biomedical entity as a plus. On the other hand, we apply and evaluate PIE the search to count the documents that indicate their semantic relationship. The resulting corpus contains a set of medical articles in XML format. From each article we construct a text file by extracting relevant fields such as the title, the summary, and the body (if they are available).

Frequent Pattern Mining.
We test 8 groups of data ( 1 ∼ 8 in Table 2), each of which only including annotations  published by one annotator and belonging to one correctness group. The max group ( 7) has 700 annotations and about 36 biomedical entities but on different attribute sets, while the min group ( 3) has 100 annotations and about 10 biomedical entities. Biomedical entities in each group have some common attributes, which can be recognized as frequent pattern paths (fre. Attr. column in the table) after the first round of computing in the algorithm. Some of the frequent pattern paths appear in every biomedical entity, we say that they are 100% fre. Attr. Association of such items is certainly frequent; thus, we put their association directly into the finial mining result set but ignore another round of computing. The experimental results (Figure 7) show that the main time consumer is recursively computing the associate frequent pattern paths. 3 takes the highest time, because the 18 frequent (frequency below 100%) items need 15 rounds of computing to judge whether any level of their associations is also frequent. 4 is carried out at minimal cost, because no  frequent pattern path can be found and only the first round of computing will happen.
6.4. Ranking. The experiments are executed over 5 sets of data. Different data sets contain different scales of annotations and frequent patterns. As shown in Table 3, 1 is the minimal data set, where 5000 annotations submitted by 100 annotators on 50 biomedical entities will be evaluated and ranked with 49 frequent patterns, while 5 is the maximal one including 40,000 annotations from 200 annotators on 200 biomedical entities, where it will be evaluated and ranked with 400 frequent patterns. For that weight on edge between each user and biomedical entity are precomputed and stored in database, the most time-consuming is the pattern matching. As shown in Figure 8, time goes up as number of patterns or annotations goes up. But even for 5, 5 minutes is enough to rank 40,000 annotations, which show the efficiency and applicability of the algorithm.

Conclusion
In this paper, we propose an approach for ranking biomedical annotations according to user's voting and semantic relevancy between an annotator and the biomedical entity he annotated. Our idea is inspired by the fact that in a credible online scientific community, quality of web content is determined to some extent by the contributor's knowledge about the entity. People's knowledge can be discovered from his profile and his related historical behaviors, especially for the researchers who are deeply specialized in one scientific domain. Thus, our major work in this paper is to find out how much a given annotator may learn about a biomedical entity from his profile on the web and frequent patterns of entities that he annotated in history. An entity can be semantically defined by its attributes and its related entities' attributes. And people's knowledge about an entity can be reflected by the annotator's knowledge about those attributes. To express such relation, we extend the RDF model by assigning weight on each edge, which denotes the degree of how the root node (the annotator) knows about the target node (an entity or one of its attributes). The weight can be evaluated with the cooccurrence of the annotator and the target node in credible web information. Besides, an intent weight can indicate that people who know concept may also know 's related concept.
The second way to discover how the annotator semantically relates to the biomedical entity is frequent pattern mining over historical annotations, which revealed the common features of biomedical entities that an annotator may know. The pattern mining algorithm proposed in this paper can deal with problems caused by small example space, cold-start, and improper data source dividing.
In the future, we will go further on how to link record of a user and extract his profile information from the Internet when duplicate and uncertain data happen.

Conflict of Interests
The author declares that there is no conflict of interests regarding the publication of this paper.