Semantic Signature: Comparative Interpretation of Gene Expression on a Semantic Space

Background. Interpretation of microarray data remains challenging because biological meaning should be extracted from enormous numeric matrices and be presented explicitly. Moreover, huge public repositories of microarray dataset are ready to be exploited for comparative analysis. This study aimed to provide a platform where essential implication of a microarray experiment could be visually expressed and various microarray datasets could be intuitively compared. Results. On the semantic space, gene sets from Molecular Signature Database (MSigDB) were plotted as landmarks and their relative distances were calculated by Lin's semantic similarity measure. By formal concept analysis, a microarray dataset was transformed into a concept lattice with gene clusters as objects and Gene Ontology terms as attributes. Concepts of a lattice were located on the semantic space reflecting semantic distance from landmarks and edges between concepts were drawn; consequently, a specific geographic pattern could be observed from a microarray dataset. We termed a distinctive geography shared by microarray datasets of the same category as “semantic signature.” Conclusions. “Semantic space,” a map of biological entities, could serve as a universal platform for comparative microarray analysis. When microarray data were displayed on the semantic space as concept lattices, “semantic signature,” characteristic geography for a microarray experiment, could be discovered.


Background
Microarray experiments provide high-throughput gene expression profiles to address biological questions for a specific condition. It has been challenging so far to extract biological implication from huge matrices of numeric data and to represent it in a concise and intuitive manner. Also, as microarray datasets accumulate in public repositories, it has become another important issue in microarray analysis how to compare multiple datasets and integrate them [1].
As for the first issue (analysis of an individual microarray experiment), semantic approach was suggested as one of main strategies. For example, clusters of genes with similar expression profiles could be assigned with Gene Ontology (GO) terms or related pathways [2]. However, most of clustering analyses with ontological annotation merely provided long list of biological terms for clusters; they failed to represent functional relationship between clusters and the whole picture of microarray experiments could hardly be grasped. Concept lattice analysis, or formal concept analysis, was proposed as a way to summarize biological information from clusters without annotation redundancy [3]. Concept lattice analysis is a mathematical technique that recognizes hierarchical structure from a relation matrix of objects (clusters) and attributes (annotations) and represents it as a graph (lattice). In this way, clusters are depicted as nodes and set-inclusion relationships of their annotations are drawn as edges in a concept lattice, which can be viewed as an executive summary of the microarray data. The second issue (comparative microarray analysis) is more elusive. External datasets of various conditions can be compared to interpret one microarray experiment of interest [4][5][6][7][8][9]. Or multiple experiments can be gathered and analyzed to deduce causality or association between genes [10][11][12][13][14][15][16][17][18][19][20]. Such comparative analyses can reduce noises that individual experiments might contain and can help with arranging vast datasets to reveal novel relationship between phenotypes or gene expressions. However, comparisons are inevitably dependent on reference data being compared and platforms on which comparisons are made. In particular, scope of reference datasets can be restricted if the platform of comparative analysis is not compatible with diverse microarray experiments.
The current study was motivated to provide a universal, not being influenced by formats of data and platform where microarray data could be visually presented and compared. Furthermore, we hoped it could convey biological implication of experiments. For that purpose, we constructed "semantic space," a map of biological entities. Just as a map encompasses the whole territory of interest, the semantic space employed gene sets of Molecular Signature Database (MSigDB) as landmarks. MSigDB was a collection of gene sets from various sources (positional, curated, motifs, computational, and GO gene sets) and could be regarded as a representation of the biological world at this time point [6]. The coordinates of those landmarks were determined by semantic distances between them based on a predefined semantic similarity measure [21].
On the semantic space, microarray data were mapped as a concept lattice. A concept lattices was produced by formal concept analysis, with clusters of similar expression profiles as objects and GO annotations as attributes. Each cluster was located on the semantic space considering its semantic distance from nearest landmarks and edges of the lattice between clusters were also drawn.
A concept lattice described on the semantic space makes a distinctive topography on the semantic space, termed as a "semantic signature." Semantic signature would give information of how gene clusters of a microarray dataset were related to biological landmarks at a glance. Furthermore, we compared various microarray datasets by simply overlapping their semantic signatures on the semantic space. And we tested whether data of similar experiment paradigm resulted in similar semantic signatures and whether those of different paradigm did different semantic signatures. Figure 1 illustrates the brief process for construction of semantic space and plotting of semantic signature.

Construction of Semantic
Space. Gene sets of MSigDB (http://www.broad.mit.edu/gsea/msigdb) imported to generated landmarks of semantic space. MSigDB collected gene sets from various sources and organized them into 5 categories: positional, curated, motifs, computational, and GO gene sets [6]. In this study, we adopted all the gene sets from mouse species, registered in GO biological process: 192 gene sets in total.
All pair-wise distances of the gene sets were computed and summarized in a matrix. Afterwards, multidimensional scaling was applied to generate 2D coordinates of the gene sets as landmarks in semantic space (cmdscale package, R, https://www.r-project.org/). Finally, the landmarks were plotted in a plane using scalable vector graphics and finally the semantic space was constructed.

Concept Lattices from Microarray Datasets.
From one microarray dataset, one concept lattice was generated as follows. First, -means clustering analysis produced 100 clusters for a dataset. Then, clusters were annotated with significant GO terms based on hypergeometric probability with Bonferroni correction. Third, a relation matrix of clusters and annotations was built. Fourth, a formal concept analysis transformed the relation matrix into a concept lattice with clusters as objects (extent) and GO terms as attributes (intent). For graphic representation, this study adopted Ganter's algorithm [36]. Accordingly, a concept was a set of clusters sharing GO terms. Edges between concepts  in the lattice indicated set-inclusion relationship. Therefore, the graph structure of a concept lattice could convey the whole information of a relation matrix of clusters and GO annotations. (http://www.snubi.org/software/biolattice/.)

Discovery of Semantic
Signature. Once semantic space was constructed and a concept lattice of microarray experiments was generated, the concept lattice could be mapped on the semantic space. Because extents of a concept lattice were clusters, or sets of genes, semantic distance from the concept to any landmark on the semantic space could be calculated just as distances between landmark gene sets were calculated (see Figure 2). To appropriately place the concept on the semantic space, 3 nearest landmarks were found for each concept. The coordinate of the concept was then determined within the triangle of the 3 nearest landmarks according to relative distances to the 3 vertices. In this way, all the concepts of a concept lattice could be located on the semantic space.
Edges between concepts were drawn as well. Finally, the whole concept lattice of microarray experiments was depicted on the semantic space and unique geographic feature could be observed ( Figure 2). In order to compare multiple microarray datasets, concept lattices of them were mapped on the semantic space simultaneously. A common geographic pattern of homogeneous microarray datasets on the semantic space, termed as "semantic signature," was derived by investigating overlapping or closely neighbouring concepts and edges.
To visualize the semantic signature, overlapped edges were emphasized by increasing colour intensity according to the overlapping frequency.

Microarray Datasets.
Microarray data of hepatotoxic agent experiments was obtained from a toxicogenomics study by Toxicogenomics Research Center (TGRC). The study applied 12 toxic agents to mice orally or intraperitoneally and observed gene expression profiles from liver specimen according to time course and dosage. Twelve toxic agents were D-galactosamine, ethanol, tetracycline, valproic acid, methotrexate, ANIT, methylenedianiline, phenytoin, thiabendazole, 6-mercaptopurine, phenylbutazone, and diclofenac.

Results
3.1. Semantic Space. "Semantic space" is shown in Figure 2. 192 gene sets from MSigDB are marked on it as its landmarks. Figure 2 shows the whole picture of the semantic space. Each vertex indicates one gene set. In this study, the semantic space was constructed using scalable vector graphics (SVG) so that the map could be magnified without being blurred. Magnified views of specific regions are illustrated in Figure 2, (a)∼(d).
Given that locations of those gene sets were determined according to semantic distances between them, gene sets sharing similar GO annotations congregated in close vicinity.

Semantic Signature: An Example of Hepatotoxic Agent
Experiment. As an exemplary case, we assessed the semantic signature of a microarray dataset from hepatotoxic agent experiment (see Section 2). The experiment observed acute hepatotoxic effect of 12 toxicants in mouse after oral or intraperitoneal injection and produced 12 microarray data. Expression profile from each agent was converted into a concept lattice and then mapped on the semantic space. Obtained figures of 12 concept lattices were not identical but their concepts and edges were overlapping or closely neighbouring on the semantic space. Those common edges were visually emphasized by grading the colour intensity in proportion to overlapping frequency ( Figure 2). The common concepts and edges were clustered along the right border of the semantic space making figure of a saw tooth, which was then determined as "semantic signature" of the hepatotoxicity experiment.

Semantic Signatures: Comparison of Heterogeneous Experiments.
Other various datasets from different experimental conditions or from different tissues were represented on the semantic space. Twenty heterogeneous microarray datasets were obtained from GEO (Gene Expression Omnibus, http://www.ncbi.nlm.nih.gov/geo/) and they were categorized into 3 conditions (toxin-related, cancer-related, and development-related experiments) and 3 tissues (germinal, hematopoietic, and neural tissues) based on descriptions provided by GEO. To discover semantic signatures per condition or tissue, common concepts and edges were depicted in the same way described above (Figure 3)   However, such visual judgement of similarity remained somewhat subjective and arbitrary without quantitative validation. To circumvent the problem, we tested whether concept lattices of the same experiment category were closer than those of different category, based on semantic similarity among concept lattices. The calculation of the semantic distances between two lattices followed the same strategy that was used for construction of the semantic space ( Figure 4). Concept lattices of the same experiment conditions were closely located (Figure 4(a)). And the lattices of developmentrelated experiments were closer to the lattices of cancerrelated experiments than to those of toxic agent-related ones as was indicated by topographic similarity between their semantic signatures. Also, concept lattices of the same tissue were located in closer vicinity (Figure 4(b)).

Biological Interpretation of Semantic Signatures I.
For biological interpretation of the semantic signature from the hepatotoxic agent experiments, 8 most frequently neighboured landmark gene sets in the signature were listed ( Table 1). One of the landmarks was a gene set that was upregulated by transcription factor Hxc-8, which was known to interact with hematopoietic activities in the liver tissue, suggesting that reactive hematopoiesis is induced by the hepatotoxic agents (LEI HOXC8 UP gene set) [22]. Another   [23]. Other landmarks included a gene set which was associated with inflammatory change induced by TNF (TNFALPHA ADIP UP) [24] and gene sets related to apoptotic of inflammatory signalling by insulin receptor (ROS MOUSE AORTA UP and IGFR IR UP) [25]. While the semantic signatures of hepatotoxic agent experiments shared common landmarks with each other as described above (thick red line in Figure 3), each concept lattice from different hepatotoxic agent also exhibited different patterns (thin red line in Figure 3). For example, concept lattice from ethanol treatment showed more nodes and edges than that from phenytoin, which could be interpreted as ethanol resulting in perturbation in more gene clusters than phenytoin. The finding was consistent with general biological knowledge that ethanol is more toxic than phenytoin to the liver.

Biological Interpretation of Semantic Signatures II.
When geography of the semantic signatures of various microarray datasets was compared, the semantic signature from cancer-related experiments was more similar to that from development-related experiments than to that from toxic agent-related ones. The finding was consistent with an established biological knowledge that one of main oncogenic mechanisms was related to uninhibited activation of developmental pathways. Specifically, the semantic signature of development-related dataset was closely neighboured by landmarks that included differentially expressed genes during cell differentiation (

Representation of Semantic Space on 2D
Space. This study was not the first that attempted to graphically represent biological entities. Several studies have provided "map" of diverse biological objects, such as genes or proteins, based on ontological annotations, structural similarity, or sequence similarities [26][27][28]. However, those studies employed their map only to illustrate specific study results. To the authors' knowledge, semantic space was the first that developed a comprehensive map of biological entities as a reference frame where heterogeneous experiments could be represented and compared.
In this study, we constructed semantic space on 2D space for convenience purpose. But positional relationship of components, especially if they were numerous, should be distorted when represented on lower dimension. A previous work of "yeast functional map" evaluated deforming stress from multidimensional scaling as the dimensionality was changed [26]. It reported 5D space as optimal and that increasing the dimensionality from 2D to 3D did not yield satisfactory decrease of stress, suggesting that 2D space could be a reasonable choice.

Measures for Semantic
Distance. The current study adopted Lin's semantic similarity measure for the calculation of term-to-term semantic distance and best-match-average of it for the calculation of set-to-set distances [29]. Other various measures for semantic distance could be considered, such as Resnik's [30], Jiang and Conrath's [31], or Wang et al. 's measure [32] instead of Lin's measure. Similarly, bestmatch-average method could be replaced by simple average, maximum, or maximum weighted by information content for set-to-set calculation. Jaccard distance could also be employed, which did not require calculation of term-to-term distances [33].
To determine the optimal measure for semantic distance, we evaluated distribution of semantic distances between gene sets by different combination of measures. As a result, bestmatch-average of Lin's measure showed the largest variation of the distance values. Thus, it was chosen for the current study because widely distributed distances reduced distortion stress during dimension reduction. In addition, we also assessed how close were the 3 nearest landmarks that determined a coordinate of each concept in the semantic space, by calculating area of the triangle composed of the 3 landmarks. It was assumed that if 3 nearest landmarks were widely dispersed, placing the concept within the triangle would give less information for biological interpretation. The combination of Lin's measure and best-match-average method resulted in tolerable size of triangles (data not shown).

Limitations and Future Study.
A few limitations should be noted in this study. As already mentioned, 2D representation of numerous landmarks could not avoid inaccuracy in their location. Another limitation is the fact that it could not be claimed that 192 gene sets from MSigDB represented all necessary spots of biological world and the semantic space spanned sufficiently wide territory. Lastly, determination of semantic signature included arbitrary component. Although visual representation of numeric data could help intuitive interpretation, similarity versus difference of geographic pattern should include subjective judgement inevitably.
Future research should consider expression of semantic space in higher dimension and could expand semantic space by merging other sources of biological data. In addition, we should think over to develop a quantitative and objective way by which similarity/difference of semantic signatures could be assessed.

Conclusions
Semantic space was constructed as a map of biological entities based on their relative semantic distances. When concept lattices were projected on the semantic space, "semantic signature," which was defined as specific geographic patterns observed on it, allowed intuitive interpretation of microarray experiments. Comparison of semantic signatures of various microarray datasets revealed distinctive features according to experiment conditions or tissues. In conclusion, "semantic space" could serve as a universal platform for comparative microarray analysis and "semantic signature" could be discovered.