MicroRNAs constitute an important class of noncoding, single-stranded, ~22 nucleotide long RNA molecules encoded by endogenous genes. They play an important role in regulating gene transcription and the regulation of normal development. MicroRNAs can be associated with disease; however, only a few microRNA-disease associations have been confirmed by traditional experimental approaches. We introduce two methods to predict microRNA-disease association. The first method, KATZ, focuses on integrating the social network analysis method with machine learning and is based on networks derived from known microRNA-disease associations, disease-disease associations, and microRNA-microRNA associations. The other method, CATAPULT, is a supervised machine learning method. We applied the two methods to 242 known microRNA-disease associations and evaluated their performance using leave-one-out cross-validation and 3-fold cross-validation. Experiments proved that our methods outperformed the state-of-the-art methods.
MicroRNAs constitute a class of non-protein-coding small RNAs, 20 to 25 nucleotides long, that bind to the 3′ untranslated region of target mRNAs to regulate mRNA turnover and translation. There are many biological processes, which are regulated by microRNAs, such as development, differentiation, apoptosis, and diseases [
The importance of microRNA-disease association prediction has been appreciated for some time [
To overcome the above challenges and to effectively predict associations, we explored the computational method KATZ [
We used three types of data, microRNA-microRNA association, microRNA-disease association, and disease-disease association data. The microRNA-microRNA association dataset includes 271 microRNAs, and the association is denoted by a functional similarity score. The dataset was downloaded from
Distribution of the three datasets.
Dataset | Matrix | Similarity score >0 |
---|---|---|
MicroRNA-microRNA association dataset |
|
56289 |
Disease-disease association dataset |
|
20285172 |
MicroRNA-disease association dataset |
|
242 |
With the above datasets, we could construct a microRNA-microRNA network, a disease-disease network, and a microRNA-disease network using a bipartite graph. For example, Figure
Bipartite graph of the microRNA-disease association network.
The degree distributions of microRNAs and diseases in the bipartite graph of the microRNA-disease association network are illustrated in Figure
Degree distributions of microRNAs and diseases in the bipartite graph of the microRNA-disease association network.
We propose to compare our methods with the previously described microRNA-based similarity inference (MBSI), phenotype-based similarity inference (PBSI), and network-consistency-based inference (NetCBI) methods [
Statistical data for the bipartite graph of the microRNA-disease association network.
Title | Number |
---|---|
MicroRNAs | 271 |
Diseases | 5080 |
Known-associating microRNAs | 99 |
Known-associating diseases | 51 |
Known-associations | 242 |
Average number of microRNA degrees | 2.44 |
Average number of disease degrees | 4.75 |
We introduce two different computational methods, which were presented by [
KATZ is similar to classical approaches, such as random walk [
For the known associations between microRNAs and diseases, we constructed an unweighted, undirected graph and derived a corresponding adjacency matrix of the graph. To vividly describe the method, we illustrate a simple unweighted, undirected graph, in Figure
Unweighted, undirected graph.
Next, we integrated different walks of different length to obtain a comprehensive association measure. We introduced a nonnegative coefficient
From formula (
To use KATZ, we need a microRNA-disease association adjacent matrix
Setting
One of the advantages of KATZ is that it can study human microRNA-disease association and association for other species. In KATZ, this is achieved simply by changing the submatrix of adjacent matrix
CATAPULT is a supervised learning method. General supervised learning methods need positive examples and negative examples. However, for microRNA-disease association, there is a lack of negative examples. Positive associations can be checked through existing methods, but there is not a method to prove negative associations. Because negative associations are seldom proven, we processed the problem by treating all nonpositive association node pairs as unlabeled because previous studies have shown that most unlabeled pairs have a negative association [
A study by Mordelet and Vert [
INIT
For Select the set Train a classifier based on positive examples For any
The KATZ and CATAPULT methods were applied to the 242 known microRNA-disease associations to infer potential microRNA-disease associations. First, we mainly verified microRNA-disease associations. The set of 242 known microRNA-disease associations is regarded as the “gold standard” data and was used to evaluate the performance of KATZ and CATAPULT methods in the leave-one-out and 3-fold cross-validation experiment and training dataset in the comprehensive prediction [
We used a receiver operating characteristic (ROC) curve to evaluate the effect of the method. Varying the threshold plots a ROC curve, and the numeric representation of a ROC curve is the area under the curve (AUC). If we could not compare which method was best from the ROC curve, we could compare the AUC. In the experiment of leave-one-out cross-validation, KATZ and CATAPULT were tested on the 242 known microRNA-disease associations and AUC values 98.9% and 98.8% for KATZ and CATAPULT were achieved. Figure
ROC curves of KATZ and CATAPULT methods by leave-one-out cross-validation.
For the leave-one-out cross-validation, we carry out one loop for each known microRNA-disease association. In each loop, we hide a microRNA-disease association in the known association group and run KATZ and CATAPULT methods on the remaining associations repeating 242 times to ensure that each known microRNA-disease association is hidden exactly once. In each loop, we order the 5080 diseases for the microRNAs, which is the hidden association. We rule that if the disease that is the hidden association has the highest
Distribution of diseases on the basis of microRNAs.
Number of microRNAs | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 8 | 9 | 10 | 12 | 15 | 20 | 24 | 27 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Number of diseases | 5029 | 16 | 10 | 6 | 3 | 3 | 2 | 4 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
In the experiment of 3-fold cross-validation, KATZ and CATAPULT were tested on the 242 known microRNA-disease associations and AUC values 98.4% and 98.3% for KATZ and CATAPULT were achieved. Figure
ROC curves of KATZ and CATAPULT methods by 3-fold cross-validation.
Recovery of microRNA-disease associations with respect to disease rank under leave-one-out cross-validation.
To confirm the strength of our methods, we compared them with MBSI, PBSI, and NetCBI. MBSI and PBSI both work on the basis of recommendation. However, MBSI takes full advantage of microRNAs similarity. This means that if association between a microRNA and a disease has been validated, then other similar microRNAs would be recommended to the disease. The drawback of MBSI is that it overlooks disease-disease associations. In contrast, PBSI takes full advantage of disease similarities but overlooks the microRNA-microRNA associations. NetCBI considers both associations. The basic idea of NetCBI is ranking. Suppose a microRNA and a disease are linked; if a microRNA is ranked top by querying the microRNAs and a disease is ranked top by querying the diseases, then it rules that associations exist between top-ranking microRNAs and top-ranking diseases.
We used leave-one-out cross-validation to compare our methods with previous methods based on the same datasets. Table
Comparison of different prediction methods based on AUC values.
Method | MBSI | PBSI | NetCBI | KATZ | CATAPULT |
---|---|---|---|---|---|
AUC | 74.83% | 54.02% | 80.66% | 98.9% | 98.8% |
We verify the top 10 predicted associations, which were not identified in our microRNA-disease association dataset. However, the latest online databases provide the evidence. The online databases that we referenced were OMIM, HMDD, and miR2Disease. Tables
Top 10 newly predicted microRNA-disease associations by KATZ.
Rank | MicroRNA | OMIM disease ID | Disease | Source |
---|---|---|---|---|
1 | hsa-let-7i | 211980 | Lung cancer | HMDD |
2 | hsa-let-7d | 114480 | Breast cancer | HMDD |
3 | hsa-mir-145 | 211980 | Lung cancer | HMDD |
4 | hsa-mir-18a | 114480 | Breast cancer | HMDD |
5 | hsa-mir-145 | 114480 | Breast cancer | HMDD |
6 | hsa-mir-106b | 114480 | Breast cancer | HMDD |
7 | hsa-let-7e | 114480 | Breast cancer | HMDD |
8 | hsa-let-7b | 114480 | Breast cancer | HMDD |
9 | hsa-mir-19a | 114480 | Breast cancer | HMDD |
10 | hsa-mir-125a | 114480 | Breast cancer | HMDD |
Top 10 newly predicted microRNA-disease associations by CATAPULT.
Rank | MicroRNA | OMIM disease ID | Disease | Source |
---|---|---|---|---|
1 | hsa-let-7a | 176807 | Prostate cancer | miR2Disease |
2 | hsa-mir-34a | 114480 | Breast cancer | HMDD |
3 | hsa-mir-21 | 211980 | Lung cancer | HMDD |
4 | hsa-let-7c | 114480 | Breast cancer | HMDD |
5 | hsa-mir-19a | 114480 | Breast cancer | HMDD |
6 | hsa-let-7a | 151400 | Chronic lymphocytic leukemia | miR2Disease |
7 | hsa-mir-29b | 114480 | Breast cancer | miR2Disease |
8 | hsa-mir-146a | 211980 | Lung cancer | HMDD |
9 | hsa-mir-155 | 211980 | Lung cancer | HMDD |
10 | hsa-let-7c | 114550 | Hepatocellular carcinoma | miR2Disease |
Identifying microRNA-disease associations is an important part of understanding disease mechanisms. Although experimental methods can identity microRNA-disease associations, they are time-consuming and expensive. Hence, efficient methods to identity microRNA-disease associations are desired.
We introduce KATZ and CATAPULT methods for predicting microRNA-disease associations. KATZ succeeds in processing social network links to achieve prediction, which is a different strategy to other methods, such as PBSI and MBSI. The KATZ method uses the entire heterogeneous network, including microRNA-microRNA association, microRNA-disease association, and disease-disease association networks. CATAPULT is a supervised learning method and uses a biased SVM. KATZ and CATAPULT significantly outperform other prediction microRNA-disease association methods, assessed by the leave-one-out and 3-fold cross-validation evaluation strategy. The potential microRNA-disease association predicted by KATZ and CATAPULT will facilitate biological experiments, which identify the true associations between microRNAs and diseases. The KATZ uses the simple measure on the heterogeneous network to predict the potential microRNA-disease associations. KATZ’s performance is relatively poor on the sparse known associations.
Although our methods perform well, better methods would be proposed to predict microRNA-disease associations. There are many features of microRNAs and diseases that are not used to help predict microRNA-disease associations, such as gene ontology and the external manifestations of disease. With the use of more factors in prediction methods and the emergence of new relevant data, the prediction of microRNA-disease association will further advance. Ultimately this will help the medical treatment of disease.
The authors declare that they have no conflict of interests.
Quan Zou analyzed data and designed the project and coordinated it. Jinjin Li created the front end user interface and developed the web server. Yun Wu and Hua Shi were involved in drafting the paper. Ying Ju had given final approval of the version to be published. Qingqi Hong helped revise the paper and gave helpful suggestion. All authors read and approved the final paper.
The work was supported by the Natural Science Foundation of China (nos. 61370010, 61202011, and 61303004), the Natural Science Foundation of Fujian Province of China (no. 2014J01253, no. 2013J05103), and the Open Fund of Shanghai Key Laboratory of Intelligent Information Processing, China (no. IIPL-2014-004).