Establishing Reliable miRNA-Cancer Association Network Based on Text-Mining Method

Associating microRNAs (miRNAs) with cancers is an important step of understanding the mechanisms of cancer pathogenesis and finding novel biomarkers for cancer therapies. In this study, we constructed a miRNA-cancer association network (miCancerna) based on more than 1,000 miRNA-cancer associations detected from millions of abstracts with the text-mining method, including 226 miRNA families and 20 common cancers. We further prioritized cancer-related miRNAs at the network level with the random-walk algorithm, achieving a relatively higher performance than previous miRNA disease networks. Finally, we examined the top 5 candidate miRNAs for each kind of cancer and found that 71% of them are confirmed experimentally. miCancerna would be an alternative resource for the cancer-related miRNA identification.


Introduction
MicroRNAs (miRNAs) are a large class of small noncoding RNAs [1] known to be functionally involved in a wide range of biological processes including embryo development, cell growth, differentiation, apoptosis, and proliferation [2][3][4][5].
Recently, it has been found that miRNAs play important roles in human tumor genesis and many of them have also been applied as novel biomarkers for cancer therapies [6][7][8][9][10][11], which attracts more and more efforts in revealing the complex associations between miRNAs and cancers. However, the existing literature usually focused on the relationship between several miRNAs and a specific cancer, leaving the comprehensive miRNA-cancer network unrevealed. Therefore, fully uncovering the associations between miRNAs and cancers would be extremely interesting and valuable for identifying cancerrelated miRNA and understanding the mechanisms behind.
To this aim, the manually collected miRNA-disease association databases HMDD [12] and miR2Disease [13] have been established. At present, these manually created miRNAdisease networks have been used to predict disease-related miRNAs [14][15][16] and achieved relatively high accuracies, opening opportunity of prioritizing miRNAs with bioinformatics methods.
However, thousands of papers on miRNA and cancer researches are published each year, making it difficult to manually check papers. On the other hand, automatic textmining methods are needed to extract reliable miRNAdisease associations [17] from the increasing database.
In this paper, we collected 1,018 associations between 226 miRNA families and 20 common cancers by extracting from more than 7.1 million publications with an automatic textmining method. All these relationships have been recorded in a database named miCancerna, which can be freely assessed at http://micancerna.appspot.com/. We further constructed a miRNA-cancer general view on top 5% significant associations for visualizing the roles of miRNAs in different cancers and prioritized the cancer-related miRNAs using the random 2 Computational and Mathematical Methods in Medicine walk with restart algorithm (RWRA) [14] on miRNA-cancer network built on the data in miCancerna. By analyzing the top 5 associated miRNAs of 20 cancers according to Fisher's exact tests, we found experimental evidence for 71% of these miRNA-cancer relationships, and the rest might be candidate cancer-related miRNAs for further experimental validation. The constructed miRNA-cancer network would be extremely valuable for comprehensively understanding the mechanisms of cancers and identifying cancer-related miRNA genes.

Collecting Resource Literature.
We collected the abstracts from NCBI's MEDLINE database as our target literature resource. MEDLINE is a comprehensive database containing the abstracts of millions of articles in biomedical area. Since a large number of papers are not fully accessible in the PubMed database, we only consider the abstracts for the papers, which are always available.
In 2000, Reinhart et al. [18] identified the second miRNA, and thereafter researchers began to pay attention to the importance of miRNAs. Therefore, we mainly focus on the papers that have been published in 2000 and after. In total, 7,207,066 abstracts were retrieved and then screened using keywords, such as "Humans" or "Animals, " within the PubMed search for eliminating plant and virus miRNAs in the following text-mining analysis. This filtration yielded 5,606,308 paper abstracts.
Currently, the 20 most common cancers reported by National Cancer Institute (http://www.cancer.gov/) are considered in our study, including leukemia, lung cancer bladder cancer, brain cancer, breast cancer, cervix cancer, colorectal cancer, esophageal cancer, kidney cancer, liver cancer, melanoma, myeloma, non-Hodgkin lymphoma, oral cancer, ovarian cancer, pancreatic cancer, prostate cancer, stomach cancer, thyroid cancer, and uterine cancer. The abstracts are individually marked with cancer types by the following steps: first, we mapped each cancer type to its corresponding MeSH (medical subject headings) term(s), the U.S. National Library of Medicine's controlled vocabulary that are manually assigned for articles archived in MEDLINE describing their subject matters, and then compiled a list of standard names of each type of cancer. Subsequently, we searched each article abstract for the MeSH annotations. The abstracts with MeSH terms in our cancers name list are marked with the corresponding cancer and selected for the following textmining processing.

Establishing miRNA-Cancer Networks by Text-Mining
Method. With the selected abstracts, we firstly established relationships between miRNAs and cancers by a text-mining method. The associations between miRNAs and cancers were estimated based on the cooccurrence assumption, which is the fundamental assumption in the field of text-mining and can be used to infer whether two terms are associated or not. In our case, if a particular miRNA appears in the abstracts marked by a specific cancer frequently, we can reasonably assume that they cooccurred and tend to be related. To establish the associations between miRNAs and cancers, we detect the appearance of miRNAs in the abstracts marked by cancer types. In this study, the regular expression was applied to match miRNA names against the texts with the following steps. (1) miRNAs (such as "miR-1" and "miR-2") were firstly extracted from the abstracts with the nomenclature of a "miR" prefix accompanied by a unique identifying number [19]. (2) Following the conventions, a prefixed species/state identifier can be added (e.g., "hsa-miR-1" in Homo sapiens and "pre-miR-1" for a precursor) and additional suffixes can be given to indicate loci or variant (e.g., "miR-1a-1") [20]. (3) The regular expression was also designed for the variants of some miRNAs, such as "lin-4" and "let-7. " (4) Abbreviations for more than one miRNA are also recognized by the regular expression, for example, "miR-221/222" and "miR-15 & -16. " The significance levels of the associations of the miRNAs and the cancers extracted from the marked abstracts were estimated by one-sided Fisher's exact tests [21]. For a pair of the miRNA and the cancer , the value of Fisher's exact test is calculated based on hypergeometric distribution, as follows: where is denoted as the total number of papers included in the text-mining analysis, stands for the number of papers with both the miRNA and the cancer in the abstracts, and represent, respectively, the number of abstracts containing one termand excluding the other, and is the number of papers with neither of the terms. The top 5% miRNA-cancer associations with the minimum value are considered as significant and were used to generate the general view for miRNA-cancer network. The miRNAcancer network is a bipartite network composed by miRNA nodes and cancer nodes. Each edge in miCancerna connects a miRNA and one of its corresponding cancers.

Text-Mining Quality Check.
We first queried PubMed with "MIR or MIRN or MIRNA or MICRORNA" and randomly picked up 100 MEDLINE abstracts with at least one miRNA identifier from the querying result as our evaluating data. We then investigated the reliability of detecting miRNAs in texts using the -measure, which is the harmonic mean of two other measures, recall and precision, as follows: where TP, FP, and FN are the number of true positives, false positives, and false negatives, respectively.

Random Walk with Restart Method.
Based on the network constructed by the data from miCancerna, a random walk with restart (RWRA) method is applied to prioritize cancer-related miRNAs. RWRA is one of the random walk models widely used in disease gene discovery [22]. It simulates a random walker's Computational and Mathematical Methods in Medicine 3 moves in a given network and the walker moves from a current node to a direct neighboring node or restart with a training node with the probability ( ). The movement given out by RWRA is defined as follows: where is a column-normalized adjacency matrix representing the given network. In this case, each nonzero node in stands for a certain association between a miRNA and a cancer, and these nodes are taken as seeds. is a vector representing the probabilities of the walker at each node at time , and 0 is the initial probability vector in which training nodes are equally assigned 1/ ( is the number of seeds) while others are 0. The process is iterated until reaches a stable status when the difference between +1 and (measured by 1 norm) is less than a threshold value (10 −6 in this study). The stable probability is defined as ∞ . The candidate nodes are then ranked in descending order according to ∞ .

Leave-One-Out Cross-Validation.
The performance of cancer-related miRNA prioritization by random walk with restart algorithm through miCancerna could be evaluated by calculating the area under the ROC through the leaveone-out cross-validation. For each training node, we took it as a candidate node and randomly picked 20 miRNAs not belonging to the same cancer as testing nodes and then prioritized them as above. For each threshold, the sensitivity (SN) and specificity (SP) are defined as follows: where TP (true positive) is the number of training nodes with rank above the threshold, FN (false negative) is the number of training nodes with rank under the threshold, TN (true negative) is the number of testing nodes with rank under the threshold, and FP (false positive) is the number of test nodes with rank above the threshold. The ROC curve shows the relationship between SN and 1-SP, and the AUC means the area under the ROC curve.   According to these comparison results, we concluded that miCancerna is a high-quality resource of miRNA-cancer associations.

miRNA-Cancer Network Visualization.
To reveal the roles of miRNA in different cancers, we constructed a bipartite network with the top 5% associations based on Fisher's exact test values in miCancerna, consisting of 40 miRNA families and 13 types of cancers ( Figure 1). In this bipartite network, miRNAs are only connected to cancers and cancers are only connected to miRNAs. The miRNA-cancer network was visualized with Pajek (http://vlado.fmf.uni-lj.si/pub/networks/pajek/). It is interesting to find that almost all these cancers (except the stomach cancer) can be connected via miRNAs, which indicated that different cancers might share common pathogenic components regulated by these interconnected miRNAs, while stomach cancer may be different with others.
As shown in Figure 1, miRNAs may have different involvements in cancers. Some miRNAs are specifically associated with a specific cancer. For example, miR-15 and miR-16 are tendentiously related to leukemia, and miR-122 is almost exclusively associated with liver cancer. These miRNAs may be used as biomarker candidates for diagnosis and efficacy of therapies for corresponding cancers. By contrast, some miRNAs tend to be associated with various cancers. One example is miR-21, which is shown to significantly associate with breast cancer, colorectal cancer, liver cancer, and pancreatic cancer, indicating that target genes of miR-21 might play critical roles in tumor formation.
It is interesting that four miRNA-cancer associations in top 10 (Table 1) are miRNA-leukemia associations, and 28.6% (12) of significant associations were related to leukemia, which makes leukemia the most miRNA-related cancer. Similarly, 8   found that miR-21 is the most cancer-related miRNA, which is associated with 4 (30.77%) different cancers in significant associations (breast cancer, pancreatic cancer, liver cancer, and colorectal cancers), indicating that miR-21 may be involved in an important pathway in cancer formation.

Prioritization of Cancer-Related miRNAs.
We applied RWRA on the network established by miCancerna to prioritize candidate cancer-related miRNAs, and the performance is evaluated by leave-one-out cross-validation. With a restart probability alpha of 0.9, the AUC of ROC curve can reach 0.798 (Figure 2), while the AUC of 1 stands for the perfect performance and AUC of 0.5 indicates the random performance. The performances with different restart probabilities are showed in Table 2. The AUC improves as alpha increases, but the variation is small. To rule out the possibility that the performance of miCancerna is achieved by chance, a permutation test with 300 runs was performed. For each run, the seeds are randomly selected from the candidate nodes. The average AUC of random permutations obtained by leave-one-out cross validation is 0.513, and the distribution of the random permutation AUCs is shown in Figure 3. It is obvious that there is significant difference between the AUC achieved by miCancerna and the random permutations, which supports    that the miCancerna reveals the real involvement of miRNAs in cancer biology. The top 5 potential miRNAs of each cancer are presented in Table 3, among which 71% have been evaluated by experimental evidence in dbDEMC [23] or literatures published after miCancerna. The performance of cancer-related miRNA prioritization demonstrates the reliability of miCancerna. Moreover, the top predicted miRNAs may be the potential cancer-related miRNAs for further study.

Comparison with Similar Databases.
We made comparisons with similar database or networks. First we compared the data involved in miCancerna and the manual checking database miR2Disease on the number of evidence papers. For most cancers, miCancerna provides much more evidence papers than miR2Disease (Table 4). Second, we compared the prediction performance of RWRA on miCancerna with the miRNA-cancer network used in RWRMDA [14], which was built based on HMDD, a manual database. The ROC curves  for both networks are showed in Figure 2. According to the result of leave-one-out cross-validation, the network used in RWRMDA achieved AUC of 0.763, which is lower than 0.797 achieved by miCancerna.
These results indicate that miCancerna provides an alternative resource of miRNA-cancer associations.

Conclusion
In this study, we constructed a reliable miRNA-cancer network based on text-mining method, which is stored in the database miCancerna. In current release, there are 1,018 associations between 226 miRNA families and 20 common cancers. According to our test result, the miCancerna provides a reliable and comprehensive resource of miRNA-cancer associations, which can be further used in the identification of cancer-related miRNAs.
For future development, we plan to consider more types of cancers, add regulation information to the miRNA-cancer associations, and integrate miCancerna into other related databases, such as MISIM [24], the human miRNA functional similarity and functional network.