Correlating Information Contents of Gene Ontology Terms to Infer Semantic Similarity of Gene Products

Successful applications of the gene ontology to the inference of functional relationships between gene products in recent years have raised the need for computational methods to automatically calculate semantic similarity between gene products based on semantic similarity of gene ontology terms. Nevertheless, existing methods, though having been widely used in a variety of applications, may significantly overestimate semantic similarity between genes that are actually not functionally related, thereby yielding misleading results in applications. To overcome this limitation, we propose to represent a gene product as a vector that is composed of information contents of gene ontology terms annotated for the gene product, and we suggest calculating similarity between two gene products as the relatedness of their corresponding vectors using three measures: Pearson's correlation coefficient, cosine similarity, and the Jaccard index. We focus on the biological process domain of the gene ontology and annotations of yeast proteins to study the effectiveness of the proposed measures. Results show that semantic similarity scores calculated using the proposed measures are more consistent with known biological knowledge than those derived using a list of existing methods, suggesting the effectiveness of our method in characterizing functional relationships between gene products.


Introduction
Over the last few years, domain ontologies have been successfully applied to describe entities within a variety of biological domains, with examples including the derivation of functional relationships between gene products based on the gene ontology (GO) [1][2][3], the inference of phenotype similarity between human diseases based on the human phenotype ontology (HPO) [4,5], the modeling of general computational tasks in systems biology based on the systems biology ontology (SBO) [6], and many others [7][8][9]. With an ontology to provide controlled and structured vocabularies in a specific biological domain and annotations to characterize entities in the domain with the vocabularies, relationships between the entities can be quantified by their semantic similarities in the ontology, thereby providing a convenient yet powerful means of profiling the entities and their semantic relationships [1]. Nevertheless, the automated derivation of semantic similarity between entities based on their annotations in a domain specific ontology still remains a great challenge, appealing for the development of effective and convenient computational methods [10].
In general, a domain ontology provides a set of controlled and relational vocabularies for describing domain specific knowledge. The vocabularies, also referred to as concepts or terms, are often organized as a directed acyclic graph (DAG), in which vertices denote terms and edges represent semantic relationships between the terms. It is also common that an ontology has more than one semantic relationship. For example, in the gene ontology, there are multiple types of semantic relationships such as " is a " (any instance of is also an instance of ) and " part of " (an instance of is a component of some instances of ) [1]. Given such a domain specific ontology and annotations that map entities onto the terms, most existing methods first calculate pairwise semantic similarity between the terms using the structure of the ontology and annotations of entities and then derive similarity between the entities based on similarity between the terms [10][11][12][13][14].
Taking the gene ontology as an example, in order to achieve the former objective, Resnik proposed to use the information content (the negative logarithm of the relative frequency of occurrence of a term in annotations for a set of gene products) of the lowest common ancestor of two query terms to measure their semantic similarity [11]. Lin modified this measure by taking information contents of the query terms into consideration [12]. Schlicker et al. further incorporated the relative frequency of occurrence of the lowest common ancestor into the measure of Lin [14]. Jiang and Conrath proposed to incorporate the information contents of the query terms by using a formula different from that of Lin [13]. As another branch, Wang et al. proposed to calculate semantic similarity between GO terms using only the structural information of the underlying gene ontology, with the consideration of two types of semantic relationships: is a and part of [10].
With similarities between GO terms calculated, the semantic similarity between two query gene products was often calculated using a mean-max rule [10]. More specifically, given a single GO term and a collection of GO terms, the similarity between the term and the collection was defined as the maximum similarity between the term and every term in the collection. Furthermore, the similarity between two collections of GO terms was defined as the average of similarity between every term in a collection and the other collections. Finally, since a gene product was annotated by a collection of GO terms, semantic similarity between two gene products was defined as the similarity between the corresponding two sets of GO terms.
The above methods have been successfully applied to a variety of fields, with examples including the calculation of functional similarity between proteins based on the gene ontology (GO) for the inference of disease genes [2], the characterization of phenotype similarity between human diseases based on the human phenotype ontology (HPO) [5], and many others [7]. Software packages implementing these methods have also been released and publically available in the community of bioinformatics and computational biology, with examples including GOSemSim [15], FuSSiMeG [16], and OWLSim [4]. However, disadvantages of these methods are also obvious. For example, although methods such as those in [12][13][14] took efforts to modify the method of Resnik [11], their methods often performed worse than that of Resnik in real applications [10], suggesting that the revision of information contents can hardly be effective. Also, although Wang et al. systematically considered the structure and multiple semantic relationships of the gene ontology [10], they discarded the valuable resource of information contents of GO terms, resulting in a method performing worse than that of Resnik in many applications such as the prioritization of candidate genes [2]. In addition, as we shall see in the Results section, all of these methods tend to overestimate similarity between proteins that are actually not similar in their functions, thereby yielding misleading results in applications.
With these understandings, we propose in this paper to represent a gene product using a vector that is composed of information contents of GO terms annotated for the product in the gene ontology. Based on this notion, we suggest calculating semantic similarity between gene products as the relatedness of their corresponding vectors using three measures: Pearson's correlation coefficient, cosine similarity, and the Jaccard index. We focus on the biological process namespace of the gene ontology and annotations of proteins of the budding yeast Saccharomyces cerevisiae to perform a series of comprehensive studies on the effectiveness of the proposed measures. We calculate semantic similarity scores between yeast genes relying on the biological process domain of the gene ontology, use the resulting semantic similarity scores to measure functional relationships between the proteins, and study the consistency between such relationships and known biological knowledge. Results on 141 yeast biochemical pathways, 1,022 protein families, and two large-scale yeast protein-protein interaction networks show that semantic similarity scores calculated using the proposed measures are more consistent with biological knowledge than those derived using a list of existing methods, suggesting the effectiveness of our method in characterizing semantic similarity between gene products.

The Gene Ontology and Species Specific Annotations.
The gene ontology (GO) provides a controlled vocabulary of terms for describing characteristics of gene products. This ontology covers three domains: biological process (BP), molecular function (MF), and cellular component (CC). The biological process domain defines operations or sets of molecular events with a defined beginning and end, pertinent to the functioning of living cells, tissues, organs, and organisms. The molecular function domain represents the elemental activities of a gene product at the molecular level, such as binding or catalysis. The cellular component domain describes the parts of a cell or its extracellular environment [1]. Each of these three domains is organized according to a directed acyclic graph (DAG) structure, represented as = ( , ), where is a set of vertices denoting concepts and is a set of edges denoting semantic relationships between the terms. In such a graph, we use and to denote the sets of parents and children of term , including itself, respectively, and we use and to denote ancestors and descendants of term , including itself, respectively. Note that in the gene ontology, there are multiple types of semantic relationships such as " is a " (any instance of is also an instance of ) and " part of " (an instance of is a component of some instance of ).
A species specified annotation provides a mapping from a gene product of the species to a term in a domain (BP, MF, or CC) of the gene ontology. Following common specifications, the annotation of a gene product with term implies the annotation of the gene product with all ancestors of . With this notion, we represent annotations of gene product using a binary annotation vector a = ( ) | |×1 , where = 1 if is annotated by the term indexed by or its descendants and | | the total number of terms in a domain.

Semantic Similarity as Correlation of Information Contents
Given a domain of the gene ontology and annotations for a set of gene products, the probability that a product annotated by term or its descendants is estimated using the relative frequency of occurrence of term and its descendants in the annotations is calculated by where is the number of annotations with term and the total number of annotations. The information content of term is then calculated as Moreover, information contents of all terms in the domain can be represented as a vector q = ( ) | |×1 with being the information content of the term indexed by . Calculating the Hadamard (entrywise) product of a and q, we obtain the vector of information contents for gene product as With such a vector calculated for every gene product, we propose the following three measures to quantify semantic similarity between two entities.
First, we propose to calculate the similarity as the absolute value of Pearson's correlation coefficient between the two vectors x and x ℎ for two gene products and ℎ as In this measure, we assume that information contents for the two gene products, x and x ℎ , have a linear relationship, say, Hence, it is natural to use the coefficient of determination ( 2 ) that measures how good the observations fit this linear model to quantify the similarity between the two vectors. To ease the computation, we simply calculate the absolute value of the correlation coefficient instead of 2 . Note that exchanging x and x ℎ in the linear model yields the same 2 . Second, we calculate the similarity as the cosine of the angle between the two vectors x and x ℎ for two gene products and ℎ as This is equivalent to calculating the uncentered correlation coefficient of the two vectors. It is evident that the cosine measure will yield similar results as those of the correlation measure when the means of x and x ℎ are small. Third, we calculate the similarity as the Jaccard index of the two annotation vectors a and a ℎ for two gene products and ℎ as This is equivalent to calculating the ratio of the number of elements in the intersection and union of the two annotation sets for gene products and ℎ.

Existing Methods for Calculating Semantic Similarity
Most existing methods first derive similarity scores between terms and then calculate semantic similarity scores between gene products as similarity scores between collections of annotated terms for the products. More precisely, there have been two main categories of methods for calculating pairwise concept similarity scores: (1) approaches based on information contents of terms in the gene ontology and (2) methods based on the structure of the gent ontology. The first group of approaches calculates similarity between two terms and V relying on the information content of the most specific term V in their common ancestors. Generally, a term with more specific meaning tends to have a higher information content and hence With this notion, Resnik [11] defined the similarity between and V as Lin [12] defined the similarity as Schlicker et al. [14] define the similarity as Jiang and Conrath [13] define the dissimilarity between two terms as This is equivalent to defining its reciprocal as the similarity as . 4

Computational and Mathematical Methods in Medicine
The second group of approaches calculates similarity between GO terms depending on the structure of the gene ontology. Briefly, given a term indexed by , Wang et al. iteratively calculate an -value for every ancestor ∈ to measure the contribution of to the semantic of as where the weight = 0.8 if and have the is a relationship and = 0.6 if and have the part of relationship [10]. Then, a semantic value for term is defined as ( ) = Σ ∈ ( ). Finally, the semantic similarity score between two terms and V is defined as With pairwise semantic similarity scores between GO terms being ready, the similarity between term and a set of terms is defined as where is calculated using either of the above methods. The similarity between two sets of terms and can then be calculated as Finally, for two gene products and ℎ annotated by two sets of terms and , respectively, the semantic similarity between the two objects is then defined as

Data Sources.
There have been quite a few domain specific ontologies available for characterizing entities in a variety of biological domains. Particularly, the OBO (open biological and biomedical ontologies) Foundry has released eight ontologies to provide standard descriptions of entities in biological domains [14]. Among these ontologies, biological process (BP), molecular function (MF), and cellular component (CC) are typically referred to as the gene ontology (GO), which has been widely used to describe functions of genes. The gene ontology also provides annotations of gene products for several well-studied model organisms, including yeast, fruit fly, and mouse [1]. In this paper, we focus on the biological process domain of GO and annotations of the budding yeast Saccharomyces cerevisiae to validate the effectiveness of the proposed measures. We extract 22,688 terms from the biological process domain of the gene ontology (released on April 27, 2012) and obtain 22,798 annotations of 6,383 yeast genes (released on April 28, 2012).

Distribution of Semantic Similarity Scores of Random Gene
Pairs. It is evident that a pair of genes selected at random can hardly have similar functions, and thus the semantic similarity score between such a pair of genes should be close to zero.
To validate this argument, we calculate semantic similarity scores of 100,000 pairs of yeast genes selected at random, and we summarize the distribution of the scores in Figure 1.
We can clearly see from the figure that the median similarity score of the correlation measure (0.004894) is almost 0 so is that of the cosine measure (0.003196). The median similarity score of the Jaccard measure (0.03846) is higher than those for both the correlation and the cosine measures but still lower than those for all the five existing methods.

Consistency between Gene Semantic Similarity and Pathway Data.
It is known that most biological functions rise from collaborative effects of several proteins that usually involve in the same biological process and form a pathway [17]. Hence, gene products (proteins) in the same pathway should have similar annotations in the biological process ontology and in turn own high semantic similarity scores according this ontology. On the contrary, gene products belonging to different pathways should own relatively low semantic similarity scores. To assess whether the proposed similarity measures are consistent with this knowledge, we compare semantic similarity scores between proteins within a pathway and those between proteins involved in different pathways as follows. We download from the Saccharomyces Genome database (SGD) [18] 141 pathways, each including at least two proteins. For each of these pathways, we calculate pairwise semantic similarity scores of proteins involved in the pathway, and we average these scores over all pairs of proteins to obtain the mean semantic similarity score within the pathway ( in ). Meanwhile, for each pathway, we further select at random 10 times the number of proteins as those in the pathway, calculate semantic similarity scores between these proteins and those in the pathway, and average over these scores to obtain the mean semantic similarity score outside the pathway ( out ). Then, we plot the distribution of mean similarity scores within and outside all pathways in Figure 2. From the figure, we observe that the mean similarity scores within pathways are in general large, while those outside pathways are typically small. Particularly, for all of the three proposed measures (correlation, cosine, and the Jaccard), the differences between the medians of the mean similarity  scores within and outside pathways are much more obvious than those of the five existing methods. For example, using the correlation measure, we obtain the median in over all pathways as 0.6578 and the median out as 0.02564. Using the cosine measure, we obtain a median in of 0.6600 and a median out of 0.02733. In contrast, the method of Wang produces a median in of 0.7405 and a median out of 0.2489, and the method of Resnik produces a median in of 0.4662 and a median out of 0.09956. We further calculate for each pathway the ratio of the mean semantic similarity scores within the pathway over that outside the pathway ( in / out ), and we average such ratios over all 141 pathways to obtain a criterion called fold change of semantic similarity scores within pathways against those outside pathways. We summarize the fold changes in Figure 3, from which we can clearly see the effectiveness of the proposed measures. For example, using the correlation measure, we obtain a fold enhancement of 29.93. Using the cosine measure, we obtain a fold change of 26.65. In contrast, the method of Wang only produces a fold change of 3.03, and  These observations support the conclusion that the proposed measures yield much more reasonable results in assessing functional relationships between proteins within pathways, and thus these measures are more consistent with biological knowledge than existing methods.

Consistency between Gene Semantic Similarity and Protein Domain Data.
Proteins are often composed of one or more functional regions, commonly referred to as protein domains [19]. Different domains typically account for different functions of proteins containing them, and thus different combinations of protein domains give rise to the diverse range of proteins found in nature. Hence, proteins can be classified into different families according to the domains that the proteins contain. Moreover, proteins containing the same domain, or say belonging to the same family, should have some similar functions and thus share some similar annotations in the biological domain of the gene ontology. Consequently, proteins belonging to the same family should have high semantic similarity scores according to the gene ontology. On the contrary, proteins belonging to different familiess should own relatively low semantic similarity scores. To assess whether the proposed similarity measures are consistent with this knowledge, we compare semantic similarity scores between proteins within a protein family and those between proteins belonging to different families as follows.
The Pfam database [20] provides a large collection of both high quality protein families (Pfam-A) and low quality protein families (Pfam-B). In version 26.0 of the Pfam-A collection (released in November 2011), 13,672 protein families are collected. From this data source, we extract 1,022 protein families, each including at least two yeast proteins. For each of these families, we calculate pairwise semantic similarity scores of proteins belonging to the family, and we average these scores over all pairs of proteins to obtain the mean semantic similarity score within the family (] in ). Meanwhile, for each protein family, we further select at random 10 times the number of proteins as those in the family, calculate semantic similarity scores between these proteins and those belonging to the family, and average over these scores to obtain the mean semantic similarity score outside the family (] out ). Then, we calculate for each protein family the ratio of the mean semantic similarity scores within the family over that outside the family (] in /] out ), and we average such ratios over all 1,022 protein families to obtain a criterion called fold change of semantic similarity scores within protein families against those outside families. We summarize the fold changes in Figure 4, from which we can clearly see the effectiveness of the proposed measures. For example, using the correlation measure, we obtain a fold change of 6.915. Using the cosine measure, we obtain a fold change of 6.511. Using the Jaccard measure, we obtain a fold change of 3.267. In contrast, the method of Wang only produces a fold change of 1.856, and the method of Resnik produces a slightly larger fold change of 2.370.
We further change the minimum number proteins belonging to a protein family from 2 to 10, calculate the fold change in each situation, and present the results in Table 1. Briefly, the fold change varies with the minimum number of proteins in a protein family, but the observation that the fold changes of the proposed measures are greater than those of the existing methods remains unchanged. For example, when considering protein families containing at least 10 proteins, we obtain fold changes of 9.273, 9.814, and 4.516 for the correlation, cosine, and the Jaccard measures, respectively. In contrast, the fold change for the measures of Wang, Resnik, and Schlicker are 2.090, 2.846, and 3.430, respectively. From these results, we make the conjecture that the proposed measures yield much more reasonable results in assessing functional relationships between proteins that belong to the same protein family. Hence, we conclude that the proposed measures are more consistent with biological knowledgethan existing methods.

Consistency between Gene Semantic Similarity and PPI
Data. Biological knowledge suggests that proteins often interact with each other in the collaborative generation of biological functions [21]. The collection of all physical interactions in a living organism is typically referred to as the protein-protein interaction (PPI) network, in which nodes are proteins and edges are physical interactions between the proteins. Interacting proteins are usually involved in similar biological process and thus have similar annotations in the biological process domain of the gene ontology and high semantic similarity scores. To assess whether our similarity measures are consistent with this knowledge, we assess relationships between interacting proteins and their semantic similarity scores as follows.
We download two manually curated PPI networks of Saccharomyces cerevisiae. From BioGrid (biological generic repository for interaction datasets) [22,23], we extract a PPI network composed of 3,529 nodes and 16,285 edges. From DIP (database of interacting proteins) [24,25], we extract a relative small PPI network including 2,902 nodes and 7,005 edges. For each of these networks, we calculate semantic similarity scores for interacting proteins and those for the same number of randomly selected noninteracting pairs of proteins, and we plot the distribution of these scores in Figures 5(a) and 5(b). From the figure, we obviously see that the semantic similarity scores for interacting proteins are in general larger than those for noninteracting proteins, and this observation exists for both the BioGrid and the DIP networks.
Then, for each of these networks, we average over semantic similarity scores between interacting proteins to obtain the mean semantic similarity score of interacting proteins ( int ). Meanwhile, we average over semantic similarity scores of noninteracting pairs of proteins to obtain the mean semantic similarity score of noninteracting proteins ( non ). Finally, we calculate the fold change as int / non to measure the effectiveness of a method in distinguishing the functional relationship between interacting proteins. We present the results summarized in Figure 6, from which we can see the effectiveness of the proposed measures. For example, for the BioGrid network, we obtain a fold change of 6.15 when using the correlation measure. For the DIP network, the fold change is 5.44 for the correlation measure. For the cosine and the Jaccard measures, we observe similar results. From these observations, we make the conjecture that the semantic similarity scores calculated by the proposed measures are consistent with biological knowledge about interacting proteins.
It has also been shown that proteins closer in a PPI network tend to have more similar functions [4]. With this understanding, we use the length of the shortest path between two proteins in a PPI network to measure the network proximity of the proteins, use the semantic similarity score of the two proteins to measure their functional similarity, and  plot the change of the similarity score with the closeness of proteins in Figure 6. From the figure, we can see that protein pairs tend to have higher semantic similarity scores if they are closer in the PPI network. For example, for the BioGrid network and the cosine measure, the median semantic similarity score is 0.2590 for direct interacting protein pairs, 0.0720 for protein pairs intermediated by another protein, 0.0372 for protein pairs intermediated by two other proteins, and so forth. Similar results are observed for the other two measures. These results suggest that protein similarity scores are correlated with protein closeness in a PPI network, again consistent with biological knowledge.

Conclusions and Discussion
In this paper, we have proposed an approach to represent annotations of a gene product in the gene ontology using vectors that are composed of information contents of terms in the ontology. Based on this notion, we have proposed to calculate pairwise semantic similarity between gene products by using three measures (Pearson's correlation coefficient, cosine similarity, and the Jaccard index) to quantify the relatedness of the corresponding vectors. We have performed a series of comprehensive studies on the effectiveness of the proposed measures using the ontology of biological process and annotations of the budding yeast Saccharomyces cerevisiae. Comprehensive studies on the relationships between semantic similarity of gene products and biochemical pathways, protein families, and protein-protein interaction networks show that semantic similarity scores calculated using the proposed measures are more consistent with biological knowledge than those derived using a list of five existing methods, suggesting the effectiveness of our method in characterizing functional similarity between gene products based on the gene ontology. The main advantage of the proposed measures is the simplicity in calculation and the effectiveness in characterizing semantic similarity between gene products. The representation of gene products as vectors of information contents of ontology terms is straightforward, making the followed computation easy to understand. The simplicity in presentation also benefits the computation with a low time complexity, thereby making our method suitable for large scale calculation of semantic similarity for not only applications based on the gene ontology but also those using other ontologies.
Certainly, the proposed measures can be further improved from the following aspects. First, although the contribution of a term in a domain ontology has been characterized by its information content, it is possible to further refine such contribution by adjusting the information contents with prior knowledge. For example, it is not hard to combine annotations of different organisms to achieve a more precise estimation of information contents for concepts in the gene ontology. Another possibility is to develop a Bayesian method to estimate the information contents, using existing annotations to derive the prior distribution.
Second, although the presentation of domain entities as vectors of concepts is simple yet effective, the incorporation of the structure of the concepts in the underlying ontology may further improve the performance of the proposed method. Existing algorithms for calculating similarity between two tree structures [26] might be a potential candidate along this direction.

Conflict of Interests
The author does not have any conflict of interests.