In recent years, a growing number of researchers began to focus on how to establish associations between clinical and genomic data. However, up to now, there is lack of research mining clinic-genomic associations by comprehensively analysing available gene expression data for a single disease. Colorectal cancer is one of the malignant tumours. A number of genetic syndromes have been proven to be associated with colorectal cancer. This paper presents our research on mining clinic-genomic associations for colorectal cancer under biomedical big data environment. The proposed method is engineered with multiple technologies, including extracting clinical concepts using the unified medical language system (UMLS), extracting genes through the literature mining, and mining clinic-genomic associations through statistical analysis. We applied this method to datasets extracted from both gene expression omnibus (GEO) and genetic association database (GAD). A total of 23517 clinic-genomic associations between 139 clinical concepts and 7914 genes were obtained, of which 3474 associations between 31 clinical concepts and 1689 genes were identified as highly reliable ones. Evaluation and interpretation were performed using UMLS, KEGG, and Gephi, and potential new discoveries were explored. The proposed method is effective in mining valuable knowledge from available biomedical big data and achieves a good performance in bridging clinical data with genomic data for colorectal cancer.
Cancer is one of the major diseases that endanger human life. As American Cancer Society reported, a total of 1,660,290 new cancer cases and 580,350 cancer deaths were projected to occur in the United States in 2013 [
Modern medicine is moving toward the direction of personalized medicine, which refers to the tailoring of medical treatment to the individual characteristics of each patient [
With the development of medical informatics and molecular biology, vast amounts of biomedical data have been accumulated. These data cover multiple levels, including both clinical data in macrocosmic aspect and genomic data in microcosmic aspect. However, most clinical data have no corresponding genomic data, while most genomic data have no precise clinical annotation data. Due to the lack of effective linkages, the fruits of basic research have not been translated into clinical practice completely, and problems arising in clinical practice also have not made a big difference to the basic research directions as expected. Exploited value of available biomedical data is far less than the intrinsic value of these data. Therefore, it can deepen our understanding of the origin and progression of disease, by mining association between clinical data and genomic data from massive available biomedical big data, which promote the bidirectional translation between clinical research and basic research, and ultimately achieve the purpose of promoting the development of personalized medicine.
In recent years, a growing number of researchers began to focus on how to establish associations between clinical data and genomic data. The association between clinical data and genomic data is named as clinic-genomic association in this paper, representing that a clinical feature may have an effect on the gene expression value or the gene may dominate the clinical feature. A persuasive research is the Human Disease Network established by Goh et al. [
Colorectal cancer is the second leading cause of cancer death in the United States and the fifth leading cause in China [
To this end, this paper takes colorectal cancer as a typical disease to study how to mine clinic-genomic association for a certain disease using public available datasets, aiming at facilitating the diagnosis and treatment of colorectal cancer. As well, the proposed method can provide a general way for promoting preconized medicine for other disease. The proposed method consists of three parts: extracting clinical concepts using the unified medical language system (UMLS) [
On one hand, public gene expression data repositories, such as gene expression omnibus (GEO) [
We accessed the GEO site on February 3, 2013. Colorectal cancer related GEO series (GSE) were preliminarily identified as those that passed the custom filter rule [see Supplementary Table S1 (Supplementary Material available online at
All sample data tables and platform data tables of a GSE are stored in a single SOFT file. Note that it is quite inconvenient and inefficient to read the generally huge line-based, plain text format file each time we parsed 628 GSE files into several sample table files and platform table files, with each sample table file holding data for a certain GEO sample (GSM) and each platform table file holding data for a certain GEO platform (GPL). Most of the clinical information is located in title, source, species, characteristics, and descriptions fields of GSM annotations. We developed in-house Perl program to extract these information into relational database for further analysis. GSM not from human beings and GSM without any keyword of colon, rectum, rectal, hepatic flexure, or sigmoid in the extracted annotations were eliminated during the extracted information. Sample data tables index expression measurements of multiple RNA transcripts with Probe Set IDs, while the external gene identifiers, names, and symbols were stored in platform data tables. In order to allow the same gene measured in different platforms to be unified, Probe Set IDs were mapped to the HUGO (human genome organization) symbols. Since platforms are made by different manufactories and platform data tables are provided correspondingly, UniGene symbols appear in different column of platform data tables irregularly or even disappear, making automatically mapping from Probe Set IDs to gene symbols quite difficult. Thus we implemented this procedure manually in order to make full use of platform data tables. As for platforms not providing HUGO symbols, IDconverter [
GAD was accessed on April 18, 2012. We used keyword-search function of GAD query tool to obtain colorectal cancer related records. Setting search field to “disease” and entering one of colorectal cancer related keywords [see Supplementary Table S2] each time, a total of 4784 records were picked out and downloaded into an excel file for subsequent processing.
GEO annotations are in free-text format and a certain concept is frequently presented in different ways by different scientists, making it difficult to organize and compare data generated from different research institutions. Taking “colon cancer” as an example, it can be descripted as “colon cancer,” “colon carcinoma,” “human carcinoma colon cell,” and so on. In this case, ontology is urgently needed to link these various descriptions together. UMLS is the largest thesaurus in the biomedical domain, collecting biomedical concepts from controlled vocabularies and classifications used in patient records, administrative health data, bibliographic databases, and so on. Each concept is annotated with at least one semantic type from a semantic network that broadly covers the medical domain [
Extracting clinical concepts using UMLS. Three steps were involved. First, free-text annotations were mapped to UMLS concepts using MetaMap. Second, clinical concepts were screened out by semantic types. Finally, a manual review was performed to emitted mapping errors.
First, a Java program was developed to map the “characteristics” and “description” field to UMLS concepts by calling MetaMap (a program developed at National Library of Medicine) API [
The genetic factors leading to colorectal cancer have been extensively studied, and a large numbers of research papers have been published on the subject. The large body of published biomedical literature is one of the richest data sources for systematically identifying colorectal cancer related genes without microarray expression experiment. In order to obtain nontrivial knowledge quickly and accurately, we took available literature-mining achievements as a data source instead of performing literature mining algorithm directly. GAD was employed in this paper and colorectal cancer related records in GAD has already been picked out and curated in an Excel sheet in Section
We proposed a statistical-analysis-based clinic-genomic association method for colorectal cancer. The procedure is shown in Figure
Statistical-analysis-based association mining flow. Four steps were involved. First, for each concept, GSM data were divided into two groups and further organized into different data subsets based on GSE and GPL. Second, for each data subset, differentially expressed genes were screened out according to statistical significance and biological significance. Third, for each concept, differentially expressed genes from every data subset were integrated. Finally, a series of associations were established between each concept with the corresponding differentially expressed genes.
Hundreds of data subsets need to be analysed and some analysis data subsets contain more than 400 samples. Such a heavy computation burden imposes great challenges on most computation tools. MATLAB is a software with powerful computing capabilities. Most importantly, bioinformatics toolbox of MATLAB is packed with a series of robust and well-tested functions, providing an integrated software environment for genome and proteome analysis. Based on above considerations, we use MATLAB to implement the proposed association mining method.
To explore genes that are differentially expressed in data group A relative to data group B, we first made a hypothesis that all genes are equally expressed in these two data groups and then we used the “mattest” function, which is provided by the bioinformatics toolbox of MATLAB for classical t-test, to test our hypothesis. A list of
Fold-change is defined as the average expression over all samples in a condition divided by the average expression over all samples in another conditions. The average expression should be in constant scale rather than logarithmic or exponential scale. Diverse preprocessing methods were used to obtain the preprocessed data. Some data have been logarithmic transformed, while others not. Most of the processed data did not note the used scale explicitly. So we put forward the following algorithm to detect whether the input data were in log scale. As a general rule, if the scale is around 0 to 16, it is in log scale; if it is around 0 to 40000, it is in original scale. Quantile value was computed to explore data distribution range. The original algorithm refers to GEO2R [
Only genes passing both the
A total of 665 colorectal cancer related clinical concepts, see Supplementary Table S6, were obtained using the UMLS-based method. About 115 concepts (14.5%) resulted from incorrect mapping had been ruled out during the manual review process. The most common type of mapping errors come from abbreviations, including “pain” from “pn,” “Edema” from “ED.” Semantic type mistakes were also found out, such as “Dukes Disease” from “Dukes Stage.”
Clustering clinical concepts based on semantic types and counting the number of concepts in each semantic type can provide us with an intuitive view about what is most concerned in clinical studies of colorectal cancer. Distribution of concepts acquired in this paper among the 20 semantic types was obtained; see Supplementary Figure S1. Neoplastic process, biologically active substance, finding, and disease or syndrome cover more than half of all concepts. The number of GSM related to each concept reflects the importance degree of the concept to some extent. The top 15 concepts relating to the most number of GSM were presented in Table
Top 15 concepts related to the most number of GSM.
Concept name | GSM count | Rank |
---|---|---|
Medical history | 1004 | 15 |
Family history | 1002 | 16 |
Instability | 907 | 19 |
Microsatellite repeat | 895 | 20 |
Recurrence | 844 | 21 |
Protein p53 | 824 | 22 |
Histology procedure | 643 | 27 |
Death | 570 | 29 |
Microsatellite instability | 547 | 32 |
Primary neoplasm | 514 | 34 |
Tobacco use | 446 | 36 |
Encounter due to tobacco use | 446 | 37 |
Ethnic | 446 | 38 |
Leukaemia | 434 | 40 |
Encounter due to therapy | 402 | 41 |
From GAD, we got 904 genes, of which 247 annotated with “Y,” 159 with “N,” and 823 with “Blank” (overlap exists among these three cases). Association value indicates whether a gene is associated with a disease or not. However, it is not unique for some genes, due to the fact that each value in GAD depends on a single paper, while different papers may have different conclusions. Also due to this, we did not concern the specific association value in the following analysis, but just assume that these genes are related to colorectal cancer somehow. From GEO, we got 7914 genes which are extracted from our mining results of clinic-genomic associations. A total of 8392 genes [see Supplementary Table S7] were obtained from these two data sources after removing duplicates.
Genes from GAD are extracted from the published literatures, and genes from GEO are picked out according to statistical analysis. The former is more reliable but less abundant, while the latter is just on the contrary. Genes from GEO are extracted from clinic-genomic associations, of which each one was deduced from gene expression data of one or more GSE. If we impose a different restriction on the number of association related GSE, we can get different number of genes. For instance, we got only 1687 genes after requiring more than 1 related GSE. The overlapping rate between genes from GEO and genes from GAD also varies with different restrictions. Assume genes from GAD are reliable, the overlapping rate reflects the relevance between genes from GEO with colorectal cancer to a certain extent. A method to calculate the relevance quantitatively using the overlapping rate was defined as
Figure
Quantitative evaluation of association degree between genes from GEO with colorectal cancer. The horizontal axis presents the number of association related to GSE, while the vertical axis presents the relevance score computed using formula (
To further interpret the obtained results, we can link genomic information with higher order functional information. KEGG is the right knowledge base for systematic analysis of gene functions [
The 51 genes of acquired results involving the colorectal cancer pathway.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
A total of 23517 associations between 7914 genes and 139 concepts were found out. All the associations are listed in Supplementary Table S8. Such a massive amount of associations is difficult to evaluate or interpret directly. Therefore, we used two methods to gain a more profound understanding of these associations. First, we use visualization as a powerful means to leverage the perceptual abilities of humans to find useful information from obtained associations. Shape, colour, distance, and other elements can all be used to corroborate understanding of network. In this paper, Gephi was performed visualization analysis on clinic-genomic associations from different perspectives, including overall view, data-source-feature view, and semantic-type view. Clinical concepts were inputted into Gephi as source nodes, genes as destination nodes, and absolute value of fold-changes as weight of edges. Associations are clustered by the default modularize method, fast unfolding of communities in large networks, and different classes were presented in different colours. Second, the number of association related GSE was taken as a determinant to recognize highly reliable associations.
In this view, all associations were imported to Gephi together. Force atlas was used as layout algorithm. The analysis result was shown in Supplementary Figure S18 and simplified version of the analysis result was shown in Figure
Simplified version of overall view of clinic-genomic associations. Complete version can be found in Supplementary Figure S18. This simplified version was generated by ignoring several unobvious concepts and genes to reflect important associations much clearer.
Besides, Figure
Due to various features of data source, statistical analysis results of certain data group subsets pass
Data-source-feature view of obtained clinic-genomic associations. The number of related genes for each concept was limited to no more than 10. Dual Circle Layout algorithm was employed in Gephi. Several genes stood out from this view.
Classifying clinic-genomic associations based on semantic type of clinical concepts is helpful to get a deeper understanding of associations involved by each semantic type. We analysed all of the 20 semantic types, respectively. Selected results have been presented in Supplementary Figure S2~S17. Lots of meaningful rediscoveries as well as interesting new findings were obtained. In particular, two typical semantic types, “Disease or Syndrome” and “Sign or Symptom,” are illustrated specifically in this paper in detail.
The “Disease or Syndrome” semantic type covers 676 associations between 10 clinical concept and 445 genes. The Gephi analysis results is shown in Supplementary Figure S6, from which the general acknowledged colorectal cancer related diseases, including inflammatory disease, irritable bowel syndrome, intestinal disease, and inflammatory bowel disease are very conspicuously. Besides, the “Osteoporosis” concept also comes into view. It is not broadly known to the public as colorectal cancer related disease, but some researchers have claimed that osteoporosis is associated with the risk of colorectal adenoma in women recently [
The “Sign or Symptom” semantic type includes 393 associations between 10 clinical concept and 393 genes. The Gephi output is shown in Supplementary Figure S17. In addition to abdominal bloating, the most obvious one, other familiar signs or symptoms, like red stools, diarrheal, constipation, change in bowel habit, and vomiting and nausea, also have a place in Supplementary Figure S2~S17. Furthermore, “Angina Pectoris” appears out of expectation. It is not a common symptom of colorectal cancer, but the eHealthMe website displays a group of data from colon cancer patients who have angina pectoris [
Different GSE, generally submitted to GEO by different researchers, are basically irrelevant. Therefore, it is of small possibility that an association was obtained by accident if the association can be deduced from more than one GSE. This point was also illustrated in Section
Generally, this paper proposed a method to mine associations between clinical data and genomic data using publicly available datasets, which is a great mission in the era of big data. We focused on a typical disease, colorectal cancer, to learn the potential of the vast amounts of existing biomedical data. Colorectal cancer related symptoms, diseases or syndromes, neoplastic processes, and other clinical features have all been covered in this research. This is a novel exploration for little researchers having done such a thorough work for a single disease using this mode. Outcome was appreciated, but there are still lots of space for improvement. First, clinical concepts are regarded as independent during the statistical analysis process to reduce the complexity. Therefore, thoughtful measures should be taken to guarantee accuracy. Second, as an exploration, we only take a representative database, GEO, as data source. Much more datasets could be involved in the future study. Last, to make good use of the association mining results and to share the association mining methods with peer researchers, a publicly available platform would be helpful. For this consideration, such a platform is in process now.
Aiming at facilitating the diagnosis and treatment of colorectal cancer and also providing a general way for promoting preconized medicine for other disease, this paper proposed a clinic-genomic association mining method for colorectal cancer, which consists of three parts: extracting clinical concepts using UMLS; extracting genes through literature mining; and mining clinic-genomic associations through statistical analysis. Using the proposed method, 23517 clinic-genomic associations between 139 clinical concepts and 7914 genes were obtained. Moreover, 3474 of all these associations, relating 31 clinical concepts with 1689 genes, were identified as highly reliable based on the number of association related GSE. Lots of results have been validated and there are also several new discoveries, including colorectal cancer related disease (osteoporosis) and related symptoms (angina pectoris), demonstrating the correctness and usefulness of the proposed method. These results can be shared with clinical researchers and basic researchers as well as translational researchers to suggest new study directions or to answer some unsettled questions. As bridges between clinical researches and genomic researches, these associations would be helpful to accelerate the bidirectional translation between these two fields. Besides, this method can also be transplanted to analyse other diseases, such as breast cancer and liver cancer. In the future work, we will expand the data sources, blending in ArrayExpress, SMD, to enrich our results. On the other hand, expanding knowledge of clinical concepts by combining UMLS-based concepts with electronic medical records will be an appropriate direction of our research.
Genetic association database
Gene expression omnibus
GEO Platform
GEO series
GEO sample
Kyoto Encyclopaedia of Genes and Genomes
Online Mendelian Inheritance in Man
Stanford microarray database
Simple omnibus format in text
Unified medical language system
Human genome organization.
The authors declare that they have no conflict of interests.
This work was supported by the National High Technology Research and Development Programs of China (863 Programs, no. 2012AA02A601 and no. 2012AA020201), the National Science and Technology Major Project of China (no. 2013ZX03005012), and the National Natural Science Foundation of China, no. 31100592.