miRNA-Disease Association Prediction with Collaborative Matrix Factorization

1 Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai 201804, China 2School of Information and Computer, Anhui Agricultural University, Changjiang West Road 130, Hefei, Anhui, China 3Department of Computer Science and Engineering, Inha University, Incheon, Republic of Korea 4Department of Electronic and Computer Engineering, Brunel University London, Uxbridge UB8 3PH, UK 5Center for Computational Biology and Bioinformatics, Columbia University, 1130 St. Nicholas Avenue, Room 815, New York, NY 10032, USA


Introduction
MicroRNAs (miRNAs) are a class of short noncoding RNAs (19∼25 nt), which normally regulate gene expression and protein production by targeting messenger RNAs (mRNAs) at the posttranscriptional level [1][2][3][4][5][6][7][8][9].Since the first two miRNA lin-4 and let-7 were found in 1993 and 2000 [10,11], thousands of miRNAs have been detected in eukaryotic organisms ranging from nematodes to humans.The latest venison of miRBase contains 26845 entries and more than 2000 miRNAs have been detected in human [12][13][14].With the development of bioinformatics and the progress of miRNA-related projects, researches are gradually focused on the function of miRNAs.Existing studies have shown that miRNAs are involved in many important biological processes [15,16], like cell differentiation [17], proliferation [18], signal transduction [19], viral infection [20], and so on.Therefore, it is easy to find that miRNAs have close relationship with various human complex diseases [12,[21][22][23][24][25][26].For example, researchers found that mir-433 is upregulated in gastric carcinoma by regulating the expression of GRB2, which is a known tumour-associated protein [27].Mir-126 can not only function as an inhibitor to suppress the growth of colorectal cancer cells by its overexpression, but also can help to differentiate between malignant and normal colorectal tissue [28].Besides, the change of mir-17∼92 miRNA cluster expression has close relationship with kidney cyst growth in polycystic kidney disease [29].Considering the close relationship between miRNA and disease, we should try all means to excavate all latent associations between miRNA 2 Complexity and disease and to facilitate the diagnose, prevention, and treatment human complex disease [30][31][32][33].However, using experimental methods to identify miRNA-disease association is expensive and time-consuming.As the miRNA-related theories are becoming more and more common, such as the prediction model about miRNA and disease, the function of miRNA in biological processes, and signaling pathways, new therapies are urgently needed for the treatment of complex disease; it is necessary to develop powerful computational methods to reveal potential miRNA-disease associations [12,15,20,[34][35][36][37][38][39][40].
Previous studies had shown that functionally similar miRNAs always appear in similar diseases; therefore many computational models were proposed to identify novel miRNA-disease associations [13,[41][42][43][44][45][46].For example, Jiang et al. [31] analyzed and improved disease-gene prediction model, introduced the principle of hypergeometric distribution and how to use it, and discussed its application in prediction model and its actual effect.In order to realize the prediction function of the improved model, they used different types of dataset including miRNA functional similarity data, disease phenotype similarity data, and the known human disease-miRNA association data.Therefore, the prediction accuracy of this method is greatly impacted by miRNA neighbor information and miRNA-target interaction prediction.Chen et al. [47] reported a new method HGIMDA to identify novel miRNA-disease association by using heterogeneous graph inference.This algorithm can get better prediction accuracy by integrating known miRNAdisease associations, miRNA functional similarity, disease semantic similarity, and Gaussian interaction profile kernel similarity for diseases and miRNAs.In addition, HGIMDA could be applied for new diseases and new miRNAs which do not have any known association.Li et al. [48] proposed the computational model Matrix completion for MiRNA-disease association prediction (MCMDA) to predict miRNA-disease associations.This model only uses known miRNA-disease associations and achieved better prediction performance.The limitation of MCMDA is that it could not be applied for new diseases and new miRNAs which do not have any known association.You et al. [49] developed model Path-Based MiRNA-Disease Association Prediction (PBMDA) to predict miRNA-disease associations by integrating known human miRNA-disease associations, miRNA functional similarity, disease semantic similarity, and Gaussian interaction profile kernel similarity for miRNAs and diseases.Depth-first search algorithm was used in this model to identify novel miRNAdisease associations.Benefiting from effective algorithm and reliable biological datasets, PBMDA has better prediction performance.Furthermore, Xu et al. [50] introduced an approach to identify disease-related miRNAs by the miRNA target-dysregulated network (MTDN).Furthermore, in order to distinguish and identify disease-related miRNAs from candidate, a SVM classifier based on radial basis function and the lib SVM package had been proposed.Researches have shown that miRNAs can functionally interact with environmental factors (EFs) to affect and determine human complex disease.Chen [51] proposed model miREFRWR to predict the association between disease and miRNA-EF interactions.Random walks theory was applied on miRNA similarity network and EF similarity network.In addition, drug chemical structure similarity, miRNA function similarity, and networked-based similarity were also used in miREFRWR.Based on these biological datasets and efficient calculation method, miREFRWR could be an effective tool in computational biology.What is more, Chen et al. [52] also proposed a computational model RKNNMDA to predict the potential associations between miRNA and disease.Four biological datasets, experimentally verified human miRNAdisease associations, miRNA functional similarity, disease semantic similarity, and Gaussian interaction profile kernel similarity for miRNAs and diseases were integrated into RKNNMDA.It can be found that the prediction accuracy of RKNNMDA is excellent.Moreover, RKNNMDA could be applied for new diseases which do not have any known related miRNA information.
Generally speaking, current prediction model on miRNA-disease association is still demonstrating some shortcomings.For example, unreliable datasets have a great influence on the accuracy of prediction model, such as miRNAtarget interactions and disease-genes associations.In addition, for miRNAs and diseases which do not have any known associations, we cannot use some of the existing models to predict its relevant information.In other words, we need to design and develop a new effective computational model.According to the assumption that functionally similar miRNAs always appear in similar diseases, we introduce the model of Collaborative Matrix Factorization for MiRNA-Disease Association prediction (CMFMDA) to reveal novel miRNA-disease association by integrating experimentally validated miRNA-disease associations, miRNA functional similarity information, and disease semantic similarity information.For CMFMDA, we can obtain its test results with three different ways: 5-fold CV, Local LOOCV, and global LOOCV.The AUCs of these three methods are 0.8697, 0.8318, and 0.8841, respectively, which suggest that CMFMDA is a reliable and efficient prediction model.And then, we use two cases: Esophageal Neoplasms and Kidney Neoplasms, to evaluate the performance of CMFMDA.In both of these two important diseases, 42 and 41 out of top 50 predicted miRNA-disease associations were confirmed by recent experimental literatures, respectively.In addition, experiments show that CMFMDA can be applied for diseases and miRNAs without any known association.

Human miRNA-Disease Associations.
We obtained information about the associations between miRNA and disease from HMDD, including 5430 experimentally confirmed human miRNA-diseases associations about 383 diseases and 495 miRNAs.Adjacency matrix  is proposed to describe the association between miRNA and disease.If miRNA () is associated with disease (), the entity ((), ()) is 1, otherwise 0. Furthermore, we declared two variables nm and nd to represent the number of miRNAs and diseases investigated in this paper, respectively.

MiRNA Functional Similarity.
Base on the assumption that miRNAs with similarity functions are regarded to be involved in similar diseases, Wang et al. [42] present a method to calculate the miRNA functional similarity score.We downloaded miRNA functional similarity scores from http:// www.cuilab.cn/files/images/cuilab/misim.zip and constructed matrix  to represent the miRNA function similarity network, where the entity ((), ()) represents the functional similarity score between miRNA () and ().

Disease Semantic Similarity.
In this paper, disease can be described as a Directed Acyclic Graph (DAG) and DAG() = (, (), ()) was used to describe disease , where () is the node set including all ancestor nodes of  and  itself and () is the corresponding links set including the direct edges from parent nodes to child nodes.The semantic value of disease  in DAG() is defined as follows: where Δ is the semantic contribution factor.For disease , the contribution of itself to the semantic value of disease  is 1.However, with the growth of the distance between  and other disease, the contribution will fall.Therefore, disease terms in the same layer would have the same contribution to the semantic value of disease .
If there is much in common between two diseases in DAG, their semantic similarity will become larger.Therefore, the semantic similarity between diseases () and () can be defined as follows: where  is the disease semantic similarity matrix.

CMFMDA.
In this study, we developed the computational model of Collaborative Matrix Factorization for MiRNA-Disease Association prediction (CMFMDA) to predict novel miRNA-disease associations [53].The flow of CMFMDA is shown in Figure 1.
In the first step in Figure 1, we will get the final miRNA similarity matrix  and diseases similarity matrix  by integrating miRNA functional similarity network, disease semantic similarity network, and experimentally verified miRNA-disease associations.
Then, we use WKNKN [54] to estimate the association probability for these unknown cases based on their known neighbors.
Thirdly, Collaborative Matrix Factorization was used to obtain the final prediction .This step contains three parts: (1) For the input matrix , this step adopts singular value decomposition to get the initial value of  and .

Performance Evaluation.
Based on the known miRNAdisease associations obtained from HMDD database [55], the predictive performance of CMFMDA is evaluated through two ways: Local and global LOOCV.Not only that, three computational models: WBSMDA [4], RLSMDA [12], and NCPMDA [56], were introduced to compare the prediction performance with CMFMDA.To obtain relevant miRNA information for the chosen disease , all association related to disease  was left out, and the rest of the associations serve as a training set to get prediction association by CMFMDA.For cross-validation, the difference between local LOOCV and global LOOCV is that all diseases would be investigated simultaneously or not.Furthermore, Receiver-Operating Characteristics (ROC) were used to express the difference between true positive rate (TPR, sensitivity) and false positive rate (FPR, 1 − specificity) at different thresholds.In this case, sensitivity indicates that the percentage of the test miRNA-disease association which obtained ranks higher than the given threshold.Meanwhile, specificity indicates the percentage of miRNA-disease associations below the threshold.What is more, Area under the ROC curve (AUC) could be calculated to demonstrate the prediction performance of CMFMDA.AUC = 1 showed that the model has perfect prediction ability; AUC = 0.5 indicates random prediction ability.
Figure 1: Flowchart of potential miRNA-disease associations prediction based on CMFMDA.
To illustrate the performance of CMFMDA, we compare it with the existed computational model: NCPMDA, RLSMDA, and WBSMDA.The comparison result has been shown in Figure 2. As a result, these four models obtained AUCs of 0.8841, 0.8630, 0.8501, and 0.7799 in the global LOOCV, respectively.For local LOOCV, these four models obtained AUCs of 0.8318, 0.8198, 0.8068, and 0.7213, respectively.In general, CMFMDA has not only high prediction performance, but also better ability to identify novel miRNAdisease association.

Case Studies.
All diseases in this paper have been investigated by CMFMDA to predict some novel miRNAs which have association with the disease.Here, two case studies, Esophageal Neoplasms and Kidney Neoplasms, were proposed to demonstrate the prediction performance of CMFMDA.In addition, we use two important miRNAdisease association databases to validate the prediction results: miR2Disease [57] and dbDEMC [58].A final note about validation datasets is that only the associations which were absent from the HMDD database would be used.In other words, validation datasets have no correlation with the datasets which have been used for prediction.
Esophageal Neoplasms is a serious disease in digestive system, which leads to high death rate [59][60][61].Early diagnosis and treatment is essential for improving patient's survival [62,63].Here, we use CMFMDA to identify potential miRNAs associated with Esophageal Neoplasms.As a result, 9 out of the top 10 and 42 out of the top 50 predicted related miRNAs were experimentally confirmed to be associated with Esophageal Neoplasms (See Table 1).For example, mir-133b can inhibit the cell growth and invasion of esophageal squamous cell carcinoma (ESCC) by targeting Fascin homolog 1 [59].The expression level of mir-335 is an independent prognostic factor in ESCC, which might be a potential valuable biomarker for ESCC [64].
As a common urologic malignancy, the morbidity and mortality of Kidney Neoplasm have been shown to rise gradually [65][66][67][68].Renal cell carcinoma (RCC) can be divided into several different types of cancer [69][70][71], including chromophobe RCC (CHRCC), collecting duct carcinoma (CDC), clear cell RCC (CCRCC), and papillary RCC (PRCC).Previous studies have shown that miRNAs play a significant part in Kidney Neoplasm [72][73][74].In this paper, CMFMDA was employed to identify potential miRNAs associated with Kidney Neoplasms.As a result, 9 out of the top-10 candidates and 41 out of the top-50 candidates of Kidney Neoplasm related miRNAs were confirmed by dbDEMC and miR2Ddisease (See Table 2).For example, the serum level of mi-210 may be used as a novel noninvasive biomarker for the detection of CCRCC [75].Experiment results demonstrate that mir-9 expression is correlative not only with the development of CCRCC, but also with the development of metastatic recurrence [76].
The results of cross-validation and independent case studies show that CMFMDA can satisfy the needs to identify potential miRNA-disease associations.Furthermore, all diseases in HMDD have been investigated by CMFMDA to predict potential miRNAs (See Supplementary Table 1 in Supplementary Material available online at https://doi.org/10.1155/2017/2498957).We hope that potential disease-miRNA association predicted by CMFMDA could be confirmed by further biological experiments.

Discussion
According to the assumption that functionally similar miR-NAs are often associated with similar diseases, we proposed the computational model of Collaborative Matrix Factorization for MiRNA-Disease Association prediction (CMFMDA) to identify potential miRNA-disease associations by integrating miRNA functional similarity, disease semantic similarity, and experimentally verified miRNA-disease associations.We compare CMFMDA with the existing computational model: NCPMDA, RLSMDA, and WBSMDA, and concluded that CMFMDA has better prediction performance from these four models' obtained AUCs in the global LOOCV or local LOOCV, respectively.There are some reasons for the reliable performance of CMFMDA.Firstly, several types of experimentally confirmed biological datasets are used in CMFMDA, including known miRNA-disease associations, miRNA functional similarity network, and disease semantic similarity network, which help improve the prediction performance and reduce variance.Then, CMFMDA can work not only for known miRNA-disease association, but also for diseases and miRNAs without any known association.Finally, as a global prediction model, CMFMDA could be used to predict all disease-related miRNA at the same time.
Although CMFMDA has better prediction performance, the limitation still exists in it and needs to be improved in the future.Firstly, CMFMDA may cause bias to miRNAs with more known associated diseases.Secondly, the known miRNA-disease associations with experimental evidences are still insufficient.The prediction performance of CMFMDA will be improved by integrating more reliable biological information [77][78][79][80][81][82][83][84][85][86].Finally, how to more reasonably extract and integrate information from biological datasets should be investigated in the future.

Conclusions
Research has shown that the abnormal expression of miRNA plays a crucial role in the occurrence and development of human complex diseases.The in-depth study and analysis of diseases-related miRNA could help find new biomarker and therapies and then improve the survival rate of patients.Therefore, it is necessary to develop more effective computational models to identify potential miRNA-disease associations.In this paper, we presented a computational model CMFMDA to identify novel miRNA-disease associations.Except for disease semantic similarity and miRNA functional similarity, CMFMDA also uses known miRNA-disease associations to predict miRNA-disease associations.LOOCV was chosen to evaluate the predict performance of CMFMDA.The results of LOOCV and case studies show that CMFMDA has better prediction performance than other models.In other words, as an effective tool, CMFMDA can be used not only to predict potential miRNA-disease associations, but also to identify new biomarker that gave new direction for diagnosis and treatment of human complex disease.

Figure 2 :
Figure 2: Performance comparisons between CMFMDA and three state-of-the-art disease-miRNA association prediction models (NCPMDA, RLSMDA, and WBSMDA) in terms of ROC curve and AUC based on local and global LOOCV, respectively.As a result, CMFMDA achieved AUCs of 0.8841 and 0.8318 in the global and local LOOCV, significantly outperforming all the previous classical models.

( 2 )
We use  to represent the objection function and use   and   to represent the th and th row vectors of  and .Two alternative update rules (one for updating matrix  and one for updating matrix ) were derived by setting / = 0 and / = 0.According to alternating least squares, these two update rules are run alternatingly until convergence.

Table 1 :
We implemented CMFMDA to predict potential Esophageal Neoplasms-related miRNAs.As a result, 9 out of the top 10 and 42 out of the top 50 predicted Esophageal Neoplasms related miRNAs were confirmed based on miR2Disease and dbDEMC (1st column: top 1-25; 2nd column: top 26-50).

Table 2 :
We implemented CMFMDA to prioritize candidate miR-NAs for Kidney Neoplasms based on known associations in the HMDD database.As a result, 9 out of the top 10 and 41 out of the top 50 predicted Kidney Neoplasms related miRNAs were confirmed based on miR2Disease and dbDEMC (1st column: top 1-25; 2nd column: top 26-50).