The wide coverage and biological relevance of the Gene Ontology (GO), confirmed through its successful use in protein function prediction, have led to the growth in its popularity. In order to exploit the extent of biological knowledge that GO offers in describing genes or groups of genes, there is a need for an efficient, scalable similarity measure for GO terms and GO-annotated proteins. While several GO similarity measures exist, none adequately addresses all issues surrounding the design and usage of the ontology. We introduce a new metric for measuring the distance between two GO terms using the intrinsic topology of the GO-DAG, thus enabling the measurement of functional similarities between proteins based on their GO annotations. We assess the performance of this metric using a ROC analysis on human protein-protein interaction datasets and correlation coefficient analysis on the selected set of protein pairs from the CESSM online tool. This metric achieves good performance compared to the existing annotation-based GO measures. We used this new metric to assess functional similarity between orthologues, and show that it is effective at determining whether orthologues are annotated with similar functions and identifying cases where annotation is inconsistent between orthologues.
Worldwide DNA sequencing efforts have led to a rapid increase in sequence data in the public domain. Unfortunately, this has also yielded a lack of functional annotations for many newly sequenced genes and proteins. From 20% to 50% of genes within a genome [
By capturing knowledge about a domain in a shareable and computationally accessible form, ontologies can provide defined and computable semantics about the domain knowledge they describe [
GO provides three key biological aspects of genes and their products in a living cell, namely, complete description of the tasks that are carried out by individual proteins, their broad biological goals, and the subcellular components, or locations where the activities are taking place. GO consists of three distinct ontologies, molecular function (MF), biological process (BP), and cellular component (CC), each engineered as a directed acyclic graph (DAG), allowing a term (node) to have more than one parent. Traditionally, there were two types of relationships between a parent and a child. The “is_a" relation means that a child is a subclass or an instance of the parent, and the “part_of" relation indicates the child is a component of a parent. Thus, each edge in a GO-DAG represents either an “is_a” or a “part_of” association. However, another relationship has emerged, namely, “regulates”, which includes “positively_regulates” and “negatively_regulates”, and provides for relationships between regulatory terms and their regulated parents [
The GO has been widely used and deployed in several protein function prediction analyses in genomics and proteomics. This growth in popularity is mainly due to the fundamental organization principles and functional aspects of its conception displayed by its wide coverage and biological relevance. Specific tools, such as the AmiGO browser [
Considering its wide use, the issues related to its design and usage have been qualified as critical points [
Several GO term similarity measures have been proposed for characterizing similar terms, each having its own strengths and weaknesses. These similarity measures are partitioned into edge- and node-based approaches according to Pesquita et al. [
Here we introduce a new semantic similarity measure of GO terms based only on the GO-DAG topology to determine functional closeness of genes and their products based on the semantic similarity of GO terms used to annotate them. This measure incorporates position characteristic parameters of GO terms to provide an unequivocal difference between more general terms at the higher level, or closer to the root, and more specific terms at the lower level, or further from root node. This provides a clearer topological relationship between terms in the hierarchical structure. This new measure is a hybrid node- and edge-based approach, overcoming not only the issue related to the GO-DAG depth, as stated previously, but also the issues related to the dependence on the annotation statistics of node-based approaches and those related to edge-based approaches in which nodes and edges at the same level are evenly distributed.
In this section we survey existing annotation- and topology-based approaches and set up a novel GO semantic similarity metric in order to measure GO term closeness in the hierarchy of the GO-directed acyclic graph (DAG). This novel GO term semantic similarity measure is derived in order to ensure effective exploitation of the large amounts of biological knowledge that GO offers. This, in turn, provides a measurement of functional similarity of proteins on the basis of their annotations from heterogeneous data using semantic similarities of their GO terms.
We are interested in the IC-based approaches, and unlike the graph-based or hybrid approach introduced by Wang et al. [
To overcome these limitations, Wang introduced a topology-based semantic similarity measure in which the semantic value of a term
On the edge-based similarity approaches, Zhang et al. [
However, a general limitation common to all these semantic similarity measures is that none of them fully address the issue related to the depth of the GO-DAG as stated previously; that is, the depth sometimes reflects vagaries in different levels of knowledge. An example is where the structure is just growing deeper in one path without spreading sideways. In the context of the GO-DAG, such a term is sometimes declared obsolete and automatically replaced by its parent. Thus, to consider this issue, we are introducing a topological identity or synonym term measure based on term topological information in which a parent term having only one child and that child term having only that parent are assumed to be topologically identical and they are assigned the same semantic value. This provides an absolute difference between more general terms closer to the root and more specific terms further from the root node, depending on the topology of the GO-DAG, that is, whether a branch splits into more than one possible path of specificity. Furthermore, this is consistent with the human language in which the semantic similarity between a parent term and its child depends on the number of children that the parent term possesses and also the number of parents that the child term has. Intuitively a parent having more children loses specificity and this parent is no longer relevant to be used for its child specification, thus leading to a lower similarity score between this parent and each of its children.
To illustrate this, let us consider the hierarchical structure in Figure
Fictitious hierarchical structure illustrating the computation of term semantic values. Terms are nodes with “r” as a root.
Translating the biological content of a given GO term into a numeric value, called the semantic value or topological information, on the basis of its location in the GO-DAG, requires knowledge of the topological position characteristics of its immediate parents. This leads to a recursive formula for measuring topological information of a given GO term, in which the child is expected to be more specific than its parents. The more children a term has, the more specific its children are compared to that term, and the greater the biological difference. In addition, the more parents a term has, the greater the biological difference between this term and each of its parent terms. The three separate ontologies, namely, molecular function (MF), biological process (BP), and cellular component (CC) with GO Ids GO: 0003674, GO: 0008150, and GO: 0005575 respectively, are roots for the complete ontology, located at level 0, the reference level, and are assumed to be biologically meaningless. Unless specified explicitly, in the rest of this work the level of a term is considered to be the length of the longest path from the root down to that term in order to avoid a given term and its child having the same level.
The topological information
A topological position is thus a function
Note that, in general, the information we possess about something is a measure of how well we understand it and how well ordered it is.
To illustrate the way this approach works, consider the hierarchical structure shown in Figure The topological position characteristic of the root 0 is As 1 and 2 have only parent 0, which has only these two children with 3 has only one direct parent 1 with 4 has two direct parents 1 and 2. 1 has two children with 5 has only one direct parent 2, which has three children and
Hierarchical structure illustrating how our approach works. Nodes are represented by integers from 0 to 11 with 0 as a root. The numbers beside each node represent its topological position characteristic and information content.
Unlike edge-based approaches where nodes and edges are uniformly distributed, and edges at the same level of the ontology correspond to the same semantic distance between terms [
Let There exists one path
Therefore, two GO terms are equal if and only if they are either the same or topologically identical terms. Suppose that there exists a path
The topological position
The semantic similarity measure
To illustrate the GO-universal approach, we use (
Names and characteristics of GO terms in Figure
GO Id | Level | ||||
---|---|---|---|---|---|
GO:0042770 | 6 | 0.0456910e-27 | 6.525565e+01 | 10.11006 | 0.71747 |
GO:0042772 | 7 | 0.1142274e-28 | 6.664195e+01 | 12.30729 | 0.87340 |
GO:0030330 | 7 | 0.1142274e-28 | 6.664195e+01 | 11.20867 | 0.79544 |
GO:0000077 | 7 | 0.0171747e-34 | 8.235221e+01 | 10.92099 | 0.77502 |
GO:0008630 | 10 | 0.0335723e-86 | 2.014164e+02 | 12.30729 | 0.87340 |
GO:0006978 | 8 | 0.0434930e-57 | 1.343825e+02 | 12.30729 | 0.87340 |
GO:0006977 | 9 | 0.0419985e-79 | 1.850743e+02 | 12.30729 | 0.87340 |
GO:0042771 | 11 | 0.1278292e-116 | 2.691569e+02 | 12.30729 | 0.87340 |
GO:0031571 | 8 | 0.1103023e-50 | 1.173338e+02 | 12.30729 | 0.87340 |
GO:0031572 | 8 | 0.0735349e-50 | 1.177393e+02 | 12.30729 | 0.87340 |
GO:0031573 | 8 | 0.4293676e-36 | 8.373851e+01 | 12.30729 | 0.87340 |
GO:0031574 | 8 | 0.2206046e-50 | 1.166406e+02 | 12.30729 | 0.87340 |
Subgraph of the GO BP. Each box represents a GO term with GO ID,
We calculate the semantic similarity between every two consecutive GO terms in Figure
Semantic similarity values between child-parent pairwise terms in Figure
Parent GO Id | Child GO Id | |||||
---|---|---|---|---|---|---|
GO:0042770 | GO:0042772 | 0.97920 | 0.940 | 10.11006 | 0.71747 | 0.90199 |
GO:0042770 | GO:0030330 | 0.97920 | 0.940 | 10.11006 | 0.71747 | 0.94847 |
GO:0042770 | GO:0008630 | 0.32398 | 0.704 | 10.11006 | 0.71747 | 0.90199 |
GO:0042770 | GO:0000077 | 0.79240 | 0.802 | 10.11006 | 0.71747 | 0.96144 |
GO:0042772 | GO:0006978 | 0.49591 | 0.882 | 12.30729 | 0.87340 | 1.00000 |
GO:0030330 | GO:0006978 | 0.49591 | 0.889 | 11.20867 | 0.79544 | 0.95328 |
GO:0030330 | GO:0006977 | 0.36008 | 0.615 | 11.20867 | 0.79544 | 0.95328 |
GO:0030330 | GO:0042771 | 0.24760 | 0.696 | 11.20867 | 0.79544 | 0.95328 |
GO:0008630 | GO:0042771 | 0.74832 | 0.931 | 12.30729 | 0.87340 | 1.00000 |
GO:0000077 | GO:0031571 | 0.70186 | 0.830 | 10.92099 | 0.77502 | 0.94032 |
GO:0000077 | GO:0031572 | 0.69945 | 0.850 | 10.92099 | 0.77502 | 0.94032 |
GO:0000077 | GO:0031573 | 0.98344 | 0.948 | 10.92099 | 0.77502 | 0.94032 |
GO:0000077 | GO:0031574 | 0.70603 | 0.870 | 10.92099 | 0.77502 | 0.94032 |
GO:0031571 | GO:0006977 | 0.63398 | 0.774 | 12.30729 | 0.87340 | 1.00000 |
Table
A given protein may perform several functions, thus requiring several GO terms to describe these functions. For characterized or annotated pairwise proteins with known GO terms, functional closeness or GO similarities based on their annotations and consequently the distances between these proteins can be evaluated using the Czekanowski-Dice approach [
Czekanowski-Dice’s measure is not convenient for using in the case of GO term sets, since GO terms may be similar at some level without being identical. This aspect cannot be captured in Czekanowski-Dice’s measure which only requires the contribution from the GO terms exactly matched between the sets of GO terms of these proteins. One can attempt to avoid this difficulty by incorporating the true path rule in the computation of the intersection and union of GO term sets for proteins. However, in most cases where these proteins are annotated by successive GO terms in the GO-DAG, this may lead to the situation where the number of elements in the union of these sets is equal to that of their intersection plus one, in which case, the functional closeness of these proteins is forced to converge to 1, independently of the biological contents of the GO terms in the GO-DAG.
To overcome this problem, we set up a functional similarity between proteins which emphasizes semantic similarity between terms in their sets of GO terms considered to be uniformly distributed. This functional similarity is given by
Thus, owing to the fact that
This shows that the functional closeness formula emphasizes the importance of the shared GO terms by assigning more weight to similarities than differences. Thus, for two proteins that do not share any similar GO terms, the functional closeness value is 0, while for two proteins sharing exactly the same set of GO terms, the functional closeness value is 1. The functional similarity between proteins in (
Note that the approach used here to combine GO term topological information for calculating protein functional similarity scores was used in the context of annotation-based approaches and is referred to as the best match average (BMA) approach. This approach has been suggested to be better than the average (Avg) [
We have developed a semantic value measurement approach for GO terms using the intrinsic topology of the GO-DAG and taking into account issues related to the depth of the structure. We evaluate our method against the Wang et al. and Zhang et al. topology-based methods for a specific subgraph of the GO-DAG and then use UniProt data to compare our similarity scores to those of annotation-based approaches. Note that the Zhang et al. approach has recently been shown to perform equally to the Resnik measure and to perform better than the Wang et al. measure [
We have seen Section
Another negative aspect of Wang’s approach is that it essentially relies on the semantic factors of “is_a” and “part_of” relations, and it is not clear for which values of these semantic factors the semantic similarity measure yields the optimal value of biological content of terms. Moreover, these semantic factors make the similarity value between a given child and its direct parent independent of the number of children that the parent term has (shown in (
The Zhang approach, which depends only on the children of a given term, often fails to effectively differentiate a child from its parents, yielding an equal
We first evaluated the performance of the new metric by assessing its ability to capture functional coherence in a human protein-protein interaction network in terms of how interacting proteins are functionally related to each other. Expert-curated and experimentally determined human protein-protein interactions (PPIs) were retrieved from the IntAct database [
For our performance evaluation, we only used proteins annotated with BP terms in the network produced. This is because two proteins that interact physically are more likely to be involved in similar biological processes [
The classification power of the new metric was tested by receiver operator characteristic (ROC) curve analysis [
Area under ROC curves (AUCs) and precision for the human PPI dataset. For each group, the top score is in bold.
Approaches | Area under curve (AUC) | Precision | Accuracy | |||
Excluding IEA | Including IEA | Excluding IEA | Including IEA | Excluding IEA | Including IEA | |
GO-universal | ||||||
Resnik | 0.933 | 0.931 | 0.724 | 0.701 | 0.713 | 0.739 |
Lin | 0.763 | 0.691 | 0.610 | 0.568 | 0.481 | 0.549 |
SimUIC | 0.916 | |||||
SimGIC | 0.922 | 0.974 | 0.974 | |||
SimUI | 0.975 | 0.978 | 0.866 | 0.845 | 0.926 | 0.937 |
ROC evaluations of functional similarity approaches based on the human PPI dataset derived from different PPI databases.
These results indicate that all the approaches perform well. In the context of term-based approaches, the new approach performs as well as the SimGIC approach, which is the best annotation-based measure in this case, in terms of AUC, but it performs slightly better than the SimGIC approach in terms of precision excluding IEA and accuracy. When considering protein functional similarity approaches derived from GO term semantic similarity scores (first three rows of Table
Looking at the two main groups of protein functional similarity approaches, term-based approaches perform better than those using GO term semantic similarity scores. This is in part due to the fact that models of protein functional similarity approaches using GO term semantic similarity scores are based on statistical measures of closeness (Avg, Max), which are known to be sensitive to scores that lie at abnormal distances from the majority of scores, or outliers. This means that these measures may produce biases which affect protein functional similarity scores. Furthermore, we investigate if the performance can be improved by leaving out GO annotations with IEA evidence codes. Interestingly, no significant improvement is achieved when leaving out GO annotations with IEA evidence code suggesting that these IEA annotations are in fact of high quality [
We assess the effectiveness of the new metric compared to other topology-based approaches, namely, the Wang and Zhang approaches, the Resnik-related functional similarity measures, and SimGIC. We used a dataset of proteins with known relationships downloaded from the Collaborative Evaluation of Semantic Similarity Measures (CESSMs) online tool [
To evaluate the new metric, we ran the CESSM online tool and results are shown in Table
Comparison of performance of our approach with Wang et al., Zhang et al. and annotation-based ones using Pearson’s correlation with enzyme Commission (eC), Pfam and sequence similarity, and resolution. Results are obtained from the CESSM online tool. For each ontology, the top two best scores among 12 approaches are in bold.
Ontology | Approaches | Similarity measure correlation | Resolution | |||
EC | PFAM | Seq Sim | ||||
BP | GO-Universal | (BMA) | ||||
Wang et al. | 0.43266 | 0.63356 | ||||
Zhang et al. | 0.21944 | 0.26495 | 0.20270 | 0.30148 | ||
Resnik | Avg | 0.30218 | 0.32324 | 0.40685 | 0.33673 | |
Max | 0.30756 | 0.26268 | 0.30273 | 0.64522 | ||
BMA | 0.45878 | 0.73973 | 0.90041 | |||
Term-based | SimUIC | 0.38458 | 0.43693 | 0.74410 | 0.84503 | |
SimGIC | 0.39811 | 0.45470 | 0.83730 | |||
MF | GO-Universal | (BMA) | 0.60285 | 0.55163 | 0.52905 | |
Wang et al. | 0.49101 | 0.37101 | 0.33109 | |||
Zhang et al. | 0.49753 | 0.41147 | 0.32235 | 0.39865 | ||
Resnik | Avg | 0.39635 | 0.44038 | 0.50143 | 0.41490 | |
Max | 0.45393 | 0.18152 | 0.12458 | 0.38056 | ||
BMA | 0.60271 | 0.57183 | ||||
Term-based | SimUIC | 0.65826 | 0.60512 | |||
SimGIC | 0.62196 | 0.95590 |
Orthologous proteins in different species are thought to maintain similar functions. Therefore, we used protein sequence data together with protein GO annotations to determine the extent to which sequence similarities between protein orthologues are translated into similarities between their GO annotations through the GO-universal metric using protein orthologues between human (
In order to produce sequence similarity data, an all-against-all BLASTP [
Proportion in percentage of Human-Mouse orthologue pairs sharing high functional similarity.
Using all GO evidence codes | Leaving out IEA and ISS | |||
Approach | BP | MF | BP | MF |
GO-Universal | 76 | 82 | 12 | 49 |
Resnik | 76 | 80 | 13 | 38 |
The high proportion of functionally similar protein orthologues observed in the full dataset was expected, since many of the GO annotations probably arose from homology-based annotation transfer [
Some human-mouse protein orthologue pairs without GO-based functional similarity.
Protein ID | Organism | Annotation information | ||||
GO ID | GO name | Code | Source | |||
BP | A1Z1Q3 | Homo sapiens | GO:0042278 | Purine nucleoside metabolic process | IDA | UniProtKB |
Q3UYG8 | Mus musculus | GO:0007420 | Brain development | IEP | UniProtKB | |
Q96EQ8 | Homo sapiens | GO:0032480 | Negative regulation of type I interferon production | TAS | Reactome | |
GO:0045087 | Innate immune response | TAS | Reactome | |||
Q9D9R0 | Mus musculus | GO:0016567 | Protein ubiquitination | EXP | GOC | |
O00451 | Homo sapiens | GO:0007169 | Transmembrane receptor protein tyrosine kinase signaling pathway | TAS | PINC | |
GO:0035860 | Glial cell-derived neurotrophic factor receptor signaling pathway | TAS | GOC | |||
O08842 | Mus musculus | GO:0007399 | Nervous system development | IMP | MGI | |
Q9BS16 | Homo sapiens | GO:0000087 | M phase of mitotic cell cycle | TAS | Reactome | |
GO:0000236 | Mitotic prometaphase | TAS | Reactome | |||
GO:0000278 | Mitotic cell cycle | TAS | Reactome | |||
GO:0006334 | Nucleosome assembly | TAS | Reactome | |||
GO:0034080 | Cenh3-containing nucleosome assembly at centromere | TAS | Reactome | |||
Q9ESN5 | Mus musculus | GO:0045944 | Positive regulation of transcription from RNA polymerase II promoter | IDA | MGI | |
O15347 | Homo sapiens | GO:0006310 | DNA recombination | ISS | UniProtKB | |
GO:0007275 | Multicellular organismal development | TAS | PINC | |||
O54879 | Mus musculus | GO:0045578 | Negative regulation of B cell differentiation | IDA | MGI | |
GO:0045638 | Negative regulation of myeloid cell differentiation | IDA | MGI | |||
Q9NP31 | Homo sapiens | GO:0001525 | Angiogenesis | IEA | UniProtKB | |
GO:0007165 | Signal transduction | TAS | PINC | |||
GO:0007275 | Multicellular organismal development | IEA | UniProtKB | |||
GO:0030154 | Cell differentiation | IEA | UniProtKB | |||
Q9QXK9 | Mus musculus | GO:0008283 | Cell proliferation | IMP | occurs_in (CL:0000084) | |
Q9C035 | Homo sapiens | GO:0009615 | Response to virus | IEA | UniProtKB | |
GO:0044419 | Interspecies interaction between organisms | IEA | UniProtKB | |||
GO:0070206 | Protein trimerization | IDA | UniProtKB:Q9C035-1 | |||
P15533 | Mus musculus | GO:0006351 | Transcription, DNA-dependent | IEA | UniProtKB | |
GO:0006355 | Regulation of transcription, DNA-dependent | IEA | UniProtKB | |||
MF | Q86XR7 | Homo sapiens | GO:0004871 | Signal transducer activity | IMP | UniProtKB |
Q8BJQ4 | Mus musculus | GO:0005515 | Protein binding | IPI | BHF-UCL | |
Q99218 | Homo sapiens | GO:0030345 | Structural constituent of tooth enamel | IDA | BHF-UCL | |
P63277 | Mus musculus | GO:0005515 | Protein binding | IPI | MGI, BHF-UCL | |
GO:0008083 | Growth factor activity | IMP | BHF-UCL | |||
GO:0042802 | Identical protein binding | IPI | BHF-UCL | |||
GO:0043498 | Cell surface binding | IMP | BHF-UCL | |||
GO:0046848 | Hydroxyapatite binding | IDA | BHF-UCL | |||
P45379 | Homo sapiens | GO:0003779 | Actin binding | IDA | UniProtKB | |
GO:0005523 | Tropomyosin binding | IDA | UniProtKB | |||
GO:0030172 | Troponin C binding | IPI | UniProtKB | |||
GO:003113 | Troponin I binding | IPI | UniProtKB | |||
GO:0016887 | Atpase activity | IDA | UniProtKB:P45379-1-6-7-8 | |||
P50752 | Mus musculus | GO:0005200 | Structural constituent of cytoskeleton | IDA | occurs_in (CL:0000193) | |
Q9H0E3 | Homo sapiens | GO:0003713 | Transcription coactivator activity | IDA | UniProtKB | |
GO:0004402 | Histone acetyltransferase activity | IDA | UniProtKB | |||
Q8BIH0 | Mus musculus | GO:0005515 | Protein binding | IPI | UniProtKB | |
Q5T9L3 | Homo sapiens | GO:0004871 | Signal transducer activity | ISS | UniProtKB | |
Q6DID7 | Mus musculus | GO:0005515 | Protein binding | IPI | UniProtKB | |
GO:0017147 | Wnt-protein binding | IDA | UniProtKB | |||
A8CG34 | Homo sapiens | GO:0005515 | Protein binding | IPI | UniProtKB | |
Q8K3Z9 | Mus musculus | GO:0017056 | Structural constituent of nuclear pore | IEA | ENSEMBL | |
O15446 | Homo sapiens | GO:0003899 | DNA-directed RNA polymerase activity | IEA | UniProtKB | |
Q76KJ5 | Mus musculus | GO:0005515 | Protein binding | IPI | MGI |
In this work, we have set up a new approach to measure the closeness of terms in the gene ontology (GO), thus translating the difference between the biological contents of terms into numeric values using topological information shared by these terms in the GO-DAG. Like other measures, this enables us to measure functional similarities of proteins on the basis of their GO annotations derived from heterogeneous data sources using semantic similarities of their GO terms. We compare our method to two similar measures and show its advantages. The similarity measure which we defined shows consistent behaviour in that going down the DAG (away from the root) increases specificity, thus providing an effective semantic value for GO terms that reflects functional relationships between GO annotated proteins.
The relevance of this measure is evident when considering the GO hierarchy, as it makes explicit use of the two main relationships between different terms in the DAG, which makes it possible to provide a more precise view of the similarities between terms. This measure yields a simple and reliable semantic similarity between GO terms and functional similarity measure for sets of GO terms or proteins. We have validated this new metric using ROC analysis on human PPI datasets and a selected protein dataset from UniProt with their GO annotations obtained from GOA-UniProt and analysis by the Collaborative Evaluation of Semantic Similarity Measures (CESSM) online tool. Results show that this new GO-semantic value measure that we have introduced constitutes an effective solution to the GO metric problem for the next generation of functional similarity metrics.
As a biological use case, we have applied the GO-universal metric to determine functional similarity between orthologues based on their GO annotations. In most cases functional conservation was shown, but we did identify some orthologues annotated with different functions. This suggests that the new metric can be used to track protein annotation errors or missing annotations. We are currently applying it to assess the closeness of InterPro entries using their mappings to GO. This measure will also be used to design a retrieval tool for genes and gene products based on their GO annotations, providing a new tool for gene clustering and knowledge discovery on the basis of GO annotations. Given a source protein or a set of GO terms, this engine will be able to retrieve functionally related proteins from a specific proteome based on their functional closeness, or identify genes and gene products matched by these functions or very similar functions.
The authors declare that they have no conflict of interests.
N. J. Mulder generated and supervised the project, and finalized the manuscript. G. K. Mazandu analyzed, designed and implemented the model, and wrote the paper. N. J. Mulder and G. K. Mazandu analyzed data, read, and approved the final paper and N. J. Mulder approved the production of this paper.
Any work dependent on open-source software owes debt to those who developed these tools. The authors thank everyone involved with free software, from the core developers to those who contributed to the documentation. Many thanks to the authors of the freely available libraries, in particular, the GO Consortium who made this work possible. This work has been supported by the Claude Leon Foundation Postdoctoral Fellowship, the National Research Foundation (NRF) in South Africa, and Computational Biology (CBIO) research group at the Institute of Infectious Disease and Molecular Medicine, University of Cape Town.