The integration of ontologies builds knowledge structures which brings new understanding on existing terminologies and their associations. With the steady increase in the number of ontologies, automatic integration of ontologies is preferable over manual solutions in many applications. However, available works on ontology integration are largely heuristic without guarantees on the quality of the integration results. In this work, we focus on the integration of ontologies with hierarchical structures. We identified optimal structures in this problem and proposed optimal and efficient approximation algorithms for integrating a pair of ontologies. Furthermore, we extend the basic problem to address the integration of a large number of ontologies, and correspondingly we proposed an efficient approximation algorithm for integrating multiple ontologies. The empirical study on both real ontologies and synthetic data demonstrates the effectiveness of our proposed approaches. In addition, the results of integration between gene ontology and National Drug File Reference Terminology suggest that our method provides a novel way to perform association studies between biomedical terms.
In recent years, ontologies are becoming increasingly important in knowledge engineering. Generally speaking, an ontology is a collection of concepts and their relations. It has wide applications in computer science and life science. For example, in computer science, semantic web uses web ontology language (OWL) to represent knowledge bases [
Although ontologies can be modeled as a directed graph, many ontologies are in fact hierarchical trees or have hierarchical tree-like structures. In the BioPortal website, users can find basic hierarchical properties of an ontology, such as the maximum depth and the maximum number of children. In the UMLS, the hierarchical structure of an ontology is documented in the “MRHIER.RRF” file with each line being a path from a term to its root. We can build a hierarchical tree from these paths by merging the common nodes starting from the root. Because the hierarchical structures of some ontologies are in fact directed acyclic graphs, the hierarchical tree may contain some duplicated concepts. To simplify our study, we treat them as independent concepts in this work.
An important knowledge discovery task is to identify knowledge associations. In life science, this task includes finding the associations between diseases and genes [
Early studies on ontology integration relied on domain experts to manually set up the integration rules [
The basic ontology integration problem in our work can be formulated as follows. Given ontology tree structures For any two vertices It holds that
We name
A simple example of integrating two hypothetical ontologies.
An example of integrating two hypothetical ontologies that satisfy Criterion (1) is given in Figure
Another example of integrating the two ontologies in Figure
In Section
We made the following major contributions in this work. We proposed a novel ontology integration problem that optimizes the cohesion function. We identified optimal structures in this problem and developed optimal as well as efficient approximation solutions for this problem. We extended the basic problem to handle the integration of large number of ontologies, and we developed both greedy and fast approximation algorithms for the extended problem. We studied the proposed algorithms on both real and synthetic datasets and confirmed their effectiveness in integrating large volume of ontology datasets.
Automatic ontology generation and integration are desirable in many applications and have been studied in the past decade. Although available methods for automatic ontology generation produce ontologies from a given type of data, such as gene networks [
These methods have a few major weaknesses including
In this section, we focus on the basic problem of integrating two ontologies as formulated in Section
Given Criteria (1) and (2), a brutal-force approach will pick up a best solution from all the solutions that start with integration involving at least one of the roots of the two ontology trees and iteratively integrate their descendants. Considering an extreme case where each ontology tree is a path of
A heuristic solution can be developed by following an idea similar to the above brutal-force approach. However, instead of trying all possibilities, the heuristic solution will greedily merge vertices following the topological order. When selecting a matching vertex for vertex
It is easy to see that the deeper a vertex being chosen for integration is, the more integration opportunities are lost. To alleviate such a situation, we propose a greedy approach by considering the relative depth (
Algorithm
push
continue; {The subtree rooted at identify vertex break; save merge pair mark push
By dividing the integration of two trees into node merging and subtree integrations, we have identified optimal structures in the basic problem, as stated by Lemmas
Sort vertices in
push
make make let push matching results of push matching results of push matching results of
Let
We can divide the integration of tree The roots of The roots of
For case
For case
Combining cases
Let
To prove this lemma, we first prove that for any tree
For case (1), the lowest common ancestor of the roots of
Without loss of generality, we can see that, for any tree
An illustration of three cases in the proof of Lemma
Given Lemma
Define
With Lemma
At the end of Algorithm
It holds that
We will prove this theorem by mathematical induction.
Let
When
When
Given the definition of
Although Algorithm
Algorithm
Assume an ontology size is
Since the maximal weighted matching results in an overall good performance on time complexity and approximation rate, we used the maximal weighted matching in our empirical study for Algorithms
Compared to the time complexity of integrating ontologies by the dynamic programming approach as described in Algorithms
In the previous section we proposed methods for integrating two ontologies. In some biomedical applications [
Given For any two vertices It holds that
Again, we name the function
The formulation of multiple ontology integration is similar to the basic version, and it is not difficult to show that optimal structures described in Lemmas
From the above discussion we can see that direct extension of Algorithms
Although the basic multiple integration approach can finish integrating multiple ontologies, it blindly integrates ontologies without using any cohesion information between ontologies that may lead to a better integration result. To improve the basic multiple integration, we propose a greedy approach that uses the cohesion information between ontologies to guide the integration. The basic steps of the greedy approach are outlined in Algorithm
build the
identify the active tree pair to the highest score in the I integrate mark update relationship matrices; update I
An example of the InterOntology matrix’s change at the first iteration in Algorithm
The key idea in Algorithm
When we used the overall cohesion score between two ontologies to update the InterOntology matrix, we observed an interesting phenomenon that the integration in most cases is a process continuously expanding an integrated ontology. Consequently, the greedy approach is likely to yield a result similar to the basic approach.
This phenomenon can be explained by the definition of maximum cohesion function, which takes into account all pairwise closeness between merged terms. Thus, the more ontologies contained in an integrated ontology are, the more likely it will have high overall cohesion scores with other ontologies. As a result, it creates unfairness for the integration selection. To fix this issue, we use the adjusted overall cohesion scores in updating the InterOntology matrix as follows.
Given an ontology
Although the basic multiple integration and the greedy multiple integration approaches discussed above are able to integrate multiple ontologies, none of them provide any guarantee on the results in comparison with the optimal solutions. By studying the maximum cohesion scores between ontologies under a graph setting, we identified an approximation structure and developed an approximation algorithm for integrating multiple ontologies. We name it fast approximation algorithm because it not only has a lower bound on the results, but also runs faster than the greedy multiple integration algorithm proposed above.
The fast approximation algorithm for integrating multiple ontologies is sketched in Algorithm
push
adding integrate
The tree weight of the integrated tree
We will use Lemmas
Given
It is easy to see that the weight of a maximum spanning tree is no less than
Combining Claims 1, 2, and 3, we complete the proof of this theorem.
An illustration of vertex/edge contraction and weight updates in an iteration of Algorithm
It holds that
According to the problem definition,
It holds that
According to the problem definition, integrating
For the fast approximation algorithm (Algorithm
Following the above analysis, we conclude that the time complexity of the greedy multiple integration algorithm is the same as the fast approximation algorithm. However, it requires updating of InterOntology matrix which takes an excessive
Finally, it is easy to see that the basic multiple integration approach takes
Integrating multiple ontologies may face two potential problems in real applications. First, how can we efficiently generate a closeness matrix for every pair of ontologies to be integrated? Our current method
We would like to study the performances of the proposed ontology integration methods by experiments on both real and synthetic datasets. We implemented five approaches in C++: H A B G F
In the following, we report our study on the performances of
The knowledge of drug-gene relationships is desirable in many pharmacology applications [
The overall cohesion scores of H
Cohesion scores of integrating real datasets.
Depth | GO term number | NDFRT term number | Cohesion scores | |||
---|---|---|---|---|---|---|
H |
H |
H |
A |
|||
3 | 66 | 6004 | 0.0505331 | 0.229958 | 0.21999 | 1.24696 |
4 | 710 | 6972 | 0.290392 | 0.284363 | 1.46585 | 9.3835 |
5 | 5355 | 14582 | 0.285923 | 1.37714 | 0.528289 | 33.9056 |
6 | 16231 | 32841 | 0.307941 | 0.341588 | 0.673406 | 74.0293 |
Recall in Section
From Table
Although the running time of A
To understand what terms are merged in integrating real ontologies, we use the integration of GO and NDFRT at depth 6 as an example. Tables
Top 5 matched terms by the A
Rank | GO terms | NDFRT terms | Closeness score |
---|---|---|---|
1 | C1135918 smooth muscle contractile fiber | C0282606 muscle neoplasms | 1.63205 |
2 | C0010813 cytokinesis | C0086376 GTP-binding proteins | 1.15967 |
3 | C0027747 axon terminus | C0030584 parovarian cyst | 1.13352 |
4 | C1155065 T cell activation | C0007082 carcinoembryonic antigen | 1.00879 |
5 | C0007595 cell growth | C0294028 BRCA2 protein | 0.945284 |
Top 5 matched terms by the H
Rank | GO terms | NDFRT terms | Closeness score |
---|---|---|---|
1 | C0031845 biological_process | C0042890 VITAMINS | 0.244862 |
2 | C1166607 cellular_component | C1657248 apoptosome | 0.142857 |
3 | C0027540 tissue death | C0065932 MENADIOL | 0.0434219 |
4 | C0025519 metabolic process | C0042849 VITAMIN B | 0.0370248 |
5 | C0030012 oxidation-reduction process | C0027996 NICOTINIC ACID | 0.0327109 |
From Tables
A snapshot of ontology integration by A
In fact, there are multiple studies to justify the structural associations seen in Figure
In addition, we have noticed a number of meaningful integrations between GO terms and neurological terms in the NDFRT. For example, synapse is a brain related structure and the term “symmetric synapse” is associated with “trauma,” and the term “asymmetric synapse” is associated with “brain neoplasms.” Similarly, it is reasonable to see that “neuronal RNA granule” is integrated with “granulomatous disease,” a granule associated disease. As another example, it is very interesting to notice that “zyxin” is associated with “cell adhesion involved in heart morphogenesis” and that provides a link with the formation of heart.
The above observations suggest a novel way of using our ontology integration method to perform association studies between biomedical concepts.
In the following experiments we will study the performances of B
We use two sets of synthetic datasets in this study. In the first set of datasets, we fix the number of ontologies to be 10 and vary the size of each ontology from 100 to 1000. In the second set of datasets, we fix the size of each ontology to be 100 and vary the number of ontologies from 10 to 100. All the ontologies are randomly generated by constructing a minimal spanning tree from a random matrix. The relationship matrix between every pair of ontologies is also randomly generated with entry values ranging from 0 to 1. For each experiment, we generate 10 random datasets and the results reported in the following are the average results over the 10 random datasets.
The overall cohesion scores of the three approaches over different ontology sizes and over different numbers of ontologies were reported in Figures
The change of overall cohesion score over the increase of the size of each ontology. The number of ontologies is fixed at 10.
The change of overall cohesion score over the increase of the number of ontologies. The size of each ontology is fixed at 100.
The integration time of the three approaches over different ontology sizes and over different numbers of ontologies was reported in Figures
The change of integration time over the increase of the size of each ontology. The number of ontologies is fixed at 10.
The change of integration time over the increase of the number of ontologies. The size of each ontology is fixed at 100.
A snapshot of ontology integration between GO and NDFRT by A
These results suggest that F
In this work, we started with a basic problem on integrating a pair of ontology tree structures with a given closeness matrix, and later we advanced the basic problem to the problem of integrating large number of ontologies. We proved optimal structures in the basic problem and developed both optimal and efficient approximation solutions. Although the multiple ontology integration problem has similar optimal structures, it is not feasible to extend the optimal and efficient approximation solutions for the basic problem to efficiently handle multiple ontology integration. To tackle the challenge of integrating a large number of ontologies, we developed both an effective greedy approach and a fast approximation approach. The empirical study not only confirms our analysis on the efficiency of the proposed method, but also demonstrates that our method can be used effectively for biomedical association studies.
The authors declare that there is no conflict of interests regarding the publication of this paper.