One of the most important and challenging problems in biomedicine is how to predict the cancer related genes. Retinoblastoma (RB) is the most common primary intraocular malignancy usually occurring in childhood. Early detection of RB could reduce the morbidity and promote the probability of disease-free survival. Therefore, it is of great importance to identify RB genes. In this study, we developed a computational method to predict RB related genes based on Dagging, with the maximum relevance minimum redundancy (mRMR) method followed by incremental feature selection (IFS). 119 RB genes were compiled from two previous RB related studies, while 5,500 non-RB genes were randomly selected from Ensemble genes. Ten datasets were constructed based on all these RB and non-RB genes. Each gene was encoded with a 13,126-dimensional vector including 12,887 Gene Ontology enrichment scores and 239 KEGG enrichment scores. Finally, an optimal feature set including 1061 GO terms and 8 KEGG pathways was obtained. Analysis showed that these features were closely related to RB. It is anticipated that the method can be applied to predict the other cancer related genes as well.
Retinoblastoma (Rb) is a rapidly developing cancer in infants that develops in the cells of retina, the light-detecting tissue of the eye [
As a kind of neural ectoderm tumor, heritable Rb is mainly caused by the mutation of Rb gene and dysfunction of tumor suppressor genes [
System biology approaches for discovering cancer related genes have been reported [
Here, we developed a new systems biological measure to effectively and deficiently identify RB genes and their pathways. First, we identified 119 RB genes from the overlap of two gene expression studies of retinoblastoma. In order to identify GO terms and KEGG pathways that are distinct between RB and non-RB genes, 5,500 non-RB genes were randomly selected from the Ensembl genes. Then all the genes were encoded with 12,887 Gene Ontology enrichment scores and 239 KEGG enrichment scores. mRMR and IFS was used to rank these features. Dagging was employed as the prediction engine. Finally, 1061 GO terms and 8 KEGG pathways were obtained as the optimal features to discriminate an RB and non-RB gene, which has been shown to be closely related to RB.
The 119 consistently differentially expressed genes between retinoblastoma and normal retina were obtained from the overlap between differentially expressed genes discussed in two gene expression studies of retinoblastoma [
The Gene ontology enrichment score of a protein is defined as the −log10 of the hypergeometric test
We calculated the Cramer’s V coefficient [
We used the minimum redundancy maximal relevance (mRMR) method to rank the importance of the features [
Dagging is a metaclassifier that employs majority vote to combine multiple models derived from a single learning algorithm using disjoint samples [
Ten-fold cross validation was often used to evaluate the performance of a classifier [
Based on the features ranked by mRMR, we used incremental feature selection (IFS) [
After running the mRMR software, we obtained two tables for each of the ten datasets (see Supplementary S2): one is called MaxRel feature table that ranks the features according to their relevance to the class of samples and the other is called mRMR feature table that lists the ranked features by the maximum relevance and minimum redundancy to the class of samples. In the mRMR feature table, a feature with a smaller index implies that it is more important for discriminating RB and non-RB genes. Such list of ranked feature was to be used in the following IFS procedure for the optimal feature set selection.
By adding the ranked features one by one, we built 500 individual predictors based on 500 subfeature sets to predict RB genes for each of the ten datasets. We then tested the prediction performance for each of the 500 predictors and obtained the IFS results (see Supplementary S3). The IFS curves plotted based on the data of Supplementary S3 are shown in Figure
The predicted results for ten datasets.
Dataset | Optimal feature number | Sn | Sp | Acc | MCC |
---|---|---|---|---|---|
1 | 156 | 0.5727 | 0.9291 | 0.8697 | 0.5174 |
2 | 141 | 0.6273 | 0.9218 | 0.8727 | 0.5452 |
3 | 337 | 0.7364 | 0.8691 | 0.8470 | 0.5347 |
4 | 140 | 0.6000 | 0.9327 | 0.8773 | 0.5471 |
5 | 126 | 0.5636 | 0.9436 | 0.8803 | 0.5434 |
6 | 489 | 0.6273 | 0.9255 | 0.8758 | 0.5527 |
7 | 78 | 0.5545 | 0.9527 | 0.8864 | 0.5588 |
8 | 222 | 0.6364 | 0.9345 | 0.8848 | 0.5795 |
9 | 319 | 0.6545 | 0.9218 | 0.8773 | 0.5663 |
10 | 235 | 0.5545 | 0.9491 | 0.8833 | 0.5495 |
| |||||
Mean (standard deviation) | 0.6127 (0.0567) | 0.928 (0.0234) | 0.8755 (0.0113) | 0.5494 (0.017) |
Sn: sensitivity; Sp: specificity; Acc: accuracy; MCC: Matthews’s correlation coefficient.
IFS curve for the first datasets. The maximal MCC was 0.5174 when 156 features were used.
To compare the enrichment result of only positive sample and the selected GO and KEGG terms, we conducted the enrichment analysis for the 119 RB genes. The results showed that 12 GO terms were enriched significantly (Benjamini adjusted
To illustrate the biological meanings of the selected optimal feature subset, we firstly tried to classify GO terms in the optimal set into the three kinds: the biological process, cellular component, and molecular function GO terms. And the GO terms of the feature obtained by mRMR method were mapped to the children of the three root GO terms. The figures show the frequency of each GO term in the feature subset and display the ratio of the number of each GO term to the scale of the number of its children terms.
Illustrating the distribution of GO terms of biological process in the optimal feature set. (a) The frequency of GO terms of biological process. (b) The percentage of GO terms of biological process.
For the percentage of BP terms, the top five GO biological processes are GO:0006794: phosphorus utilization (4.99%), GO:0022610: biological adhesion (4.85%), GO:0008283: cell proliferation (4.81%), GO:0071840: cellular component organization or biogenesis (4.26%), and GO:0019740: nitrogen utilization (4.08%). Phosphorus utilization provides cells phosphorylation sources and ensures regular cellular activities. From the GO biological process term percentage distribution, it can be seen that GO terms related with cell proliferation and biological adhesion are also highlighted, although their term numbers are less than those of the others. This indicates that proteins assigned with these two GO terms have relatively high influence on RB. For example, RB1 is a key regulator of cell proliferation and fate in retinoblastoma, phosphorylation of which can lead to conformational alterations and inactivates the capability of RB1 to bind partner proteins [
Illustrating the distribution of GO terms of cellular component in the optimal feature set. (a) The frequency of GO terms of cellular component. (b) The percentage of GO terms of cellular component.
Extracellular matrix is associated with cell adhesion mentioned in the last section. Inadhesive cells having destroyed extracellular matrix and no natural protections tend to be tumor cells under outside pressures. Here, from the percentage distribution, it is suggested that extracellular matrix was highly related with RB. Additionally, the inclusion of membrane-enclosed lumen, organelle, and organelle part indicated that cell organelles (with or without membrane) may involve in Rb too.
Illustrating the distribution of GO terms of molecular function in the optimal feature set. (a) The frequency of GO terms of molecular function. (b) The percentage of GO terms of molecular function.
In Figure
We got eight KEGG pathway terms in the optimal set of features (see Supplementary S5), which are hsa00520 (amino sugar and nucleotide sugar metabolism) and has 00563 (glycosylphosphatidylinositol- (GPI-) anchor biosynthesis), hsa03015 (mRNA surveillance pathway), hsa03440 (homologous recombination), hsa03450 (nonhomologous end joining), hsa04110 (cell cycle), hsa04114 (oocyte meiosis), and hsa04330 (notch signaling pathway). Among them, amino sugar and nucleotide sugar metabolism (hsa00520) emphasize the sugar metabolism in eye cancer. Glycosylphosphatidylinositol- (GPI-) anchor biosynthesis (hsa00563) pathway is related with anchoring of proteins outside of membrane. The next three are all included in genetic information processing pathway. The mRNA surveillance pathway (hsa03015) involved in translation and the other two deal with replication and repair. Cell cycle (hsa04110) and oocyte meiosis (hsa04114) are related to cell growth and death, and notch signaling pathway (hsa04330) is involved in signal transduction.
The canonical pathway that links tumor suppressor gene Rb to human cancers details its interaction with the E2F transcription factors and cell-cycle progression [
We proposed a computational method to identify cancer related genes taking GO enrichment scores and KEGG enrichment scores as features. We applied this method to RB. An optimal feature set including 1061 GO terms and 8 KEGG pathways was revealed by our method, which has been shown to be closely related to RB. We believe this method is efficient and effective in prediction of novel cancer related genes and has universal applicability in the cancer research.
Zhen Li and Bi-Qing Li contributed equally to this paper.
This work was supported by grants from the National Basic Research Program of China (2011CB510101 and 2011CB510102), Innovation Program of Shanghai Municipal Education Commission (12ZZ087), and the grant of “The First-class Discipline of Universities in Shanghai” and Medical Introductory Project of Science and Technology Commission of Shanghai Municipality (124119a9500).