A Novel Model for Predicting Associations between Diseases and LncRNA-miRNA Pairs Based on a Newly Constructed Bipartite Network

Motivation Increasing studies have demonstrated that many human complex diseases are associated with not only microRNAs, but also long-noncoding RNAs (lncRNAs). LncRNAs and microRNA play significant roles in various biological processes. Therefore, developing effective computational models for predicting novel associations between diseases and lncRNA-miRNA pairs (LMPairs) will be beneficial to not only the understanding of disease mechanisms at lncRNA-miRNA level and the detection of disease biomarkers for disease diagnosis, treatment, prognosis, and prevention, but also the understanding of interactions between diseases and LMPairs at disease level. Results It is well known that genes with similar functions are often associated with similar diseases. In this article, a novel model named PADLMP for predicting associations between diseases and LMPairs is proposed. In this model, a Disease-LncRNA-miRNA (DLM) tripartite network was designed firstly by integrating the lncRNA-disease association network and miRNA-disease association network; then we constructed the disease-LMPairs bipartite association network based on the DLM network and lncRNA-miRNA association network; finally, we predicted potential associations between diseases and LMPairs based on the newly constructed disease-LMPair network. Simulation results show that PADLMP can achieve AUCs of 0.9318, 0.9090 ± 0.0264, and 0.8950 ± 0.0027 in the LOOCV, 2-fold, and 5-fold cross validation framework, respectively, which demonstrate the reliable prediction performance of PADLMP.


Introduction
MicroRNAs (miRNAs) are endogenous small and nonencoding RNA molecules, which can regulate gene expression at the posttranscriptional level by combining the 3 untranslated regions (UTRs) of target mRNAs (UTR) and lead the translation inhibited cleavage of the target mRNAs [1]. Moreover, long-noncoding RNAs (lncRNAs), as the biggest class of noncoding RNAs with length greater than 200 nt, can also regulate gene expression at different levels including transcriptional, posttranscriptional, and epigenetic regulation.
Recently, increasing studies demonstrate that lncRNAs and miRNAs play a signification role in the cell proliferation and cell differentiation [2][3][4][5] and that the interactions between lncRNAs and microRNAs may have consequences for diseases, explain disease processes, and present opportunities for new therapies [6]. For example, Dey et al. proved that lncRNA H19 would give rise to microRNAs miR-675-3p and miR-675-5p to promote skeletal muscle differentiation and regeneration [7]. Yao et al. discovered that knockdown of lncRNA XIST could exert tumor-suppressive functions in human glioblastoma stem cells by upregulating miR-152 [8]. Wang et al. demonstrated that silencing of lncRNA MALAT1 by miR-101 and miR-217 would inhibit proliferation, migration, and invasion of esophageal squamous cell carcinoma cells [9]. Zhang et al. presented that lncRNA ANRIL indicated a poor prognosis of gastric cancer and promoted tumor growth by epigenetically silencing of miR-99a/miR-449a [10]. You et al. found that miR-449a inhibited cell growth in lung cancer and regulated lncRNA NEAT1 [11]. Emmrich et al. discovered that lncRNAs MONC and MIR100HG would act as oncogenes in AMKL blasts [12]. Leung et al. found that miR-222 and miR-221 upregulated by Ang II were transcribed from a large transcript and knockdown of Lnc-Ang362 would 2 Computational and Mathematical Methods in Medicine decrease expression of miR-221 and miR-222 and reduce cell proliferation [13]. Zhu et al. discovered that lncRNA H19 and H19-derived miRNA-675 were significantly downregulated in the metastatic prostate cancer cell line M12 compared with the non-meta-static prostate epithelial cell line [14]. Hirata et al. found that lncRNA MALAT1 was associated with miR-205 and promoted aggressive renal cell carcinoma [15]. Zhao and Ren demonstrated that TUG1 knockdown was significantly associated with decreased cell proliferation and promoted apoptosis of breast cancer cells through the regulation of miR-9 [16].
More and more researches have indicated that lncRNA-miRNA interactions are associated with the development of complex diseases, but until now, as far as we know, no prediction models have been proposed for large-scale forecasting of the associations between diseases and LMPairs. However, some prediction models have been reported to infer the associations between diseases and miRNA-miRNA pairs [17][18][19][20][21]. Moreover, there are researches showing that miRNA-miRNA pairs can work cooperatively to regulate an individual gene or cohort of genes that participate in similar processes [18,22]. Inspired by these existing stateof-the-art methods and ideas for large-scale prediction of the associations between diseases and miRNA-miRNA pairs and based on the reasonable assumption that functionally similar LMPairs tend to be associated with similar diseases, in this paper, a new model named PADLMP is proposed to predict potential associations between diseases and LMPairs. To date, it is the first computational model used to predict disease-LMPairs associations. PADLMP can predict novel disease-LMPairs associations in a large scale by combining the known lncRNA-disease, miRNA-disease, and lncRNA-miRNA associations. To evaluate the prediction performance of the proposed model, evaluation frameworks of leave-oneout cross validation (LOOCV), 2-fold, and 5-fold cross validation were adopted based on the known disease-LMPairs. A series of comparison experiments were also implemented to evaluate the influence of the number of walks on prediction performance. As a result, PADLMP achieved its best performance when the number of walks was set as 2. Specifically, PADLMP achieved value of AUCs of 0.9318, 0.9090 ± 0.0264, and 0.8950 ± 0.0027 in the LOOCV, 2-fold, and 5-fold cross validation framework, respectively. The results of the prediction show that the PADLMP model is feasible and effective in predicting broad-scale disease-LMPairs associations by considering the topology information of the known disease-LMPairs dichotomous network.

LncRNA-Disease Associations.
Known lncRNA-disease associations were downloaded from different databases such as the lncRNA-disease database lncRNADisease [23], MNDR [24], and Lnc2Cancer [25], respectively, and then, after preprocessing (getting rid of duplicate associations), 2048 distinct experimentally confirmed lncRNA-disease associations that including 1126 lncRNAs and 356 diseases were finally obtained (see Supplementary Table 1). Then we further constructed an adjacency matrix A1 of size 1126 × 356 as the information source.

LncRNA-miRNA Associations.
In this section, we downloaded two versions (2015 Version and 2017 Version) of lncRNA-miRNA association datasets from the starBasev2.0 database [31], which provided the most comprehensive experimentally confirmed lncRNA-miRNA interactions based on large-scale CLIP-Seq data. And then, after preprocessing (including elimination of duplicate values, erroneous data, and disorganized data), 20324 lncRNA-miRNA interactions including 494 miRNAs and 1127 lncRNAs were obtained finally (see Supplementary Table 3).

Methods Overview.
In order to predict potential novel associations between diseases and LMPairs, a new model named PADLMP is proposed, which consists of three steps ( Figure 1). First, the construction of association network and data integrate. Second, the similarities for lncRNAs, diseases, miRNAs, and lncRNA-miRNA pairs are calculated based on the association network. Finally, potential associations between disease and LMPairs are inferred.

Construct the Associated Network
3.2.1. LncRNA-Disease Network, Disease-miRNA Network, and LncRNA-miRNA Network. Based on these newly obtained known lncRNA-disease associations, we constructed the lncRNA-disease bipartite network 1 = ( 1 , 1 ) according to the following steps.
Step 1. Let 1 be the set of newly collected 1126 lncRNAs, let 1 be the set of newly collected 356 diseases, and 1 = 1 ∪ 1 , then we can obtain the vertex set 1 of 1 .
Step 2. ∀ ∈ 1 , if there is ∈ 1 satisfying the fact that the association between and belongs to the set of newly collected 2048 lncRNA-disease associations, then we define that there is an edge between and in 1 , and by this way, we can obtain the edge set 1 of 1 . Obviously, 1 is composed of these newly collected 2048 lncRNA-disease associations.  Similar to 1 , we constructed the disease-miRNA bipartite network 2 = ( 2 , 2 ) according to the following steps.

Prediction of potential association between disease and LMPairs
Step 1. Let 1 be the set of all these newly collected miRNAs, let 2 be the set of all these newly collected diseases, and 2 = 1 ∪ 2 , then we can obtain the vertex set 2 of 2 .
Step 2. ∀ ∈ 1 , if there is ∈ 2 satisfying the fact that the association between and belongs to the set of all these newly collected disease-miRNA associations, then we define that there is an edge between and in 2 , and by this way, we can obtain the edge set 2 of 2 . Obviously, We also constructed the lncRNA-miRNA bipartite network 3 = ( 3 , 3 ) according to the following steps.
Step 1. Let 2 be the set of newly collected 1127 lncRNAs, let 2 be the set of newly collected 494 miRNAs, and 3 = 2 ∪ 2 , then we can obtain the vertex set 3 of 3 .
Step 2. ∀ ∈ 2 , if there is ∈ 2 satisfying the fact that the association between and belongs to the set of newly collected 18286 lncRNA-miRNA associations, then we define that there is an edge between and in 3 , and by this way, we can obtain the edge set 3 of 3 . Obviously, 3 is composed of these newly collected 20324 lncRNA-miRNA associations.

3.2.2.
Disease-LncRNA-miRNA Network. Based on above newly constructed bipartite networks such as 1 , 2 , and 3 , we constructed a new tripartite network 4 = ( 4 , 4 ) according to the following steps. Step the fact that the association between and belongs to 1 , the association between and belongs to 2 , and the association between and belongs to 3 simultaneously. Then we define that there are an edge between and , an edge between and , and an edge between and in 4 separately, and by this way, we can obtain the edge set 4 of 4 .
Step 2. Let ⊆ satisfying the fact that ∀ ∈ there is ∈ 3 satisfying the fact that the association between and belongs to 4 . Let ⊆ satisfying the fact that ∀ ∈ there is ∈ 3 satisfying that the association between and belongs to 4 . Let 4 = ∪ ∪ 3 , then we can obtain the vertex set 4 of 4 .

Disease-LMPairs
Network. Based on above newly obtained tripartite Disease-LncRNA-miRNA network 4 , we constructed a new bipartite disease-LMPairs network = ( , ) according to the following steps.
Step 1. ∀ ∈ and ∈ , let = ( , ) and , and by this way, we can obtain the vertex set of .
Step 2. ∀ ∈ 3 , there is = ( , ) ∈ satisfying the fact that the association between and belongs to 1 , the association between and belongs to 2 , and the association between and belongs to 3 simultaneously. Then we define that there is an edge between and in , and by this way, we can obtain the edge set of .
To make it easier to understand the construction of the network, we list in "The Meaning of Vertex and Edges in the Networks" each of the vertices, edges, and their meanings that appear in Sections 3.2.1, 3.2.2, and 3.2.3.

Calculation of the Disease Semantic Similarity (Dis-SemSim).
Firstly, we downloaded MeSH descriptors from the National Library of Medicine and curated the names of diseases using the standard MeSH disease terms. Next, we represented the relationship of different diseases by a structure of directed acyclic graph (DAG) such as DAG( ) = ( ( ), ( )). Here, ( ) represented the node set including node and its ancestor nodes, and ( ) denoted the edge set of corresponding direct links from a parent node to a child node, which represented the relationship between different diseases [32]. Then, based on the disease DAG, the contribution of an ancestor node to the semantic value of disease and the contribution of the semantic value of disease itself can be calculated by the following two equations, respectively: where ( ) represents the contribution of an ancestor node to the semantic value of disease , DV( ) represents the contribution of the semantic value of disease itself, and Δ is the semantic contribution decay factor with value between 0 and 1. The function of parameter Δ is to guarantee that, as the distances between disease and its ancestor disease increase, the contribution of to will progressively decrease. Moreover, from the above formula (1), it is easy to see that it is also reasonable to define the contribution of to itself as 1. In addition, according to the experimental results of some previous state-of-the-art methods [33,34], we will set the value of Δ as 0.5 in this paper. In order to measure disease semantic similarity that two diseases with more common ancestor nodes in the DAG shall have higher semantic similarity, based on the assumption, we can define the semantic similarity between two diseases and as follows: where ( ) and ( ) represented the node sets of the DAG of and , respectively.

Calculation of the Gaussian Interaction Profile Kernel
Similarity for Diseases (GIPSim). According to the assumption that functionally similar genes tend to be associated with similar diseases, we can integrate the topologic information of known miRNA-disease association network and lncRNAdisease association network to measure the disease similarity. Moreover, in this section, we will adopt Gaussian Interaction Profile Kernel to calculate the similarity of diseases. Firstly, based on the networks such as 1 and 2 constructed above, we can obtain two adjacency matrices such as 1 (or 2 ) as follows. For any given lncRNA (or miRNA ) and disease , while takes 1 or 2, we define that Hence, let IP ( ) denote the th column of matrix , then we can calculate the Gaussian Kernel Similarity between the diseases and based on their interaction profiles as follows: where the parameter denotes the number of diseases in ( = 1, 2).
Based on formula (5), we can adopt squared root approach to calculate the Gaussian Interaction Profile Kernel Similarity for diseases as follows:

Calculation of the Integrated Similarity between Disease.
Based on these formulas presented above, we can finally define the similarity measurement between diseases and as follows: Computational and Mathematical Methods in Medicine where

Calculation of the Gaussian Interaction Profile Kernel
Similarity for IncRNAs (miRNA). For any given two lncRNAs (miRNAs) such as ( ) and ( ), in a similar way to the calculation of GIP 1 , GIP 2 can be obtained as follows ( = 1, where IP (V ) and IP (V ) are the th row and the th row in matrix , respectively, and is the number of lncRNAs (miRNA) in .

Calculation of the Integrated Similarity between IncRNAs (miRNAs).
Based on these formulas presented above, we can finally define the similarity measurement between lncRNAs and as follows: 3.5. Similarity for LncRNA-miRNA Pairs (LMPairSim). Based on the bipartite disease-LMPairs network constructed above, for any given two lncRNA-miRNA pairs = ( , ) and = ( , ), we can calculate the similarity between them according to the following three different ways: (2) Squared Root Approach (3) Centre Distance Approach where

Prediction of Potential Associations between Diseases and
LMPairs. Inspired by the KATZ method in social networks [35], disease-gene correlation prediction [36], and lncRNAassociation prediction of disease [37], we explored the PADLMP measure by developing a new computational model for predicting disease-LMPairs associations (see Figure 1). Obviously, based on the formulas (12), (13), (14), and (15) where denotes the th disease in and denotes the th LMPair in Hence, inspired by the approach based on KATZHMDA [38] and KATZ [35], we can construct an integrated matrix 6 Computational and Mathematical Methods in Medicine DP * for further predicting the potential associations between diseases and LMPairs as follows: Based on the integrated matrix DP * constructed above and letting = { 1 , 2 , . . . , }, then, for any given lncRNA-miRNA pair ∈ and diseases node ∈ , the probability of potential association between and can be obtained as follows: where the parameter is an integer bigger than 1 and the parameter satisfies 0 < < 1.
Additionally, according to the above formula (18), it is obvious that the ( + ) × ( + ) dimensional matrix depicts the possibilities of all associations between diseases and LMPairs in , and it can be further modified into the following form: 21 22 ] , (19), it is easily to know that 12 is exactly the final prediction result matrix, which includes all of the potential associations between diseases and LMPairs in . In addition, considering that a long walker in a sparse network may be less meaningful, it will disrupt association prediction, so we set to 2, 3, and 4 here. Then, final prediction result matrix could be represented by matrix DP, PairSim, and DisSim based on aforementioned equation (19).
While = 3, there is While = 4, there is

Results
In order to estimate the prediction performance of our newly proposed model PADLMP, the leave-one-out cross validation (LOOCV) procedure was adopted based on the positive samples of disease-LMPair associations. In the LOOCV validation framework, each known disease-LMPair association is used as a test sample, and the remaining disease-LMPairs association is used as a training sample for model learning. In particular, all the disease-LMPairs without known relevance proofs will be considered as candidate samples. In the LOOCV, we can obtain the rank of each left-out testing sample relative to candidate samples, and if the test samples are with a prediction level higher than a given threshold, then it will be considered to be successfully predicted. The corresponding true positive rates (TPR, sensitivity) and false positive rates (FPR, 1 − specificity) could be obtained by setting different thresholds. Here, sensitivity measures the percentage of test samples which are predicted with a higher rank than given threshold, specificity is calculated as the percentage of negative samples ranked below a given threshold. The receiver operating characteristics (ROC) curves can be drawn by plotting TPR versus FPR by different thresholds. In order to evaluate the predictive performance of PADLMP, the areas under the ROC curve (AUC) were further calculated. 1 of the AUC value showed a perfect prediction, while 0.5 of the AUC value represented purely random performance. From the above, we can find that there are some parameters such as , adopted in our prediction model PADLMP. It is obvious that these parameters are critical to the prediction performance of our model. Moreover, in Section 3.5, three different ways have been proposed to calculate the similarity for lncRNA-miRNA pairs (LMPairSim), then we need to further evaluate the performances of these three different ways also. Hence, in this section, based on the validation framework of LOOCV, we implemented a series of comparison experiments to evaluate the influence of these parameters, and the simulation results were shown in Figure 2. As a result, from Figure 2, it is easy to see that PADLMP can achieve the best prediction performance while was set to 2. Additionally, as for other parameters , during simulations, we will set as 0.01 based on the empirical values given by previous state-of-the-art works [37,[39][40][41]. Moreover, in the LOOCV, for the similarity calculation of LMPairSim, we use formulas (12), (13), and (14) in order and then select the formula that obtains the maximum AUC value. As a result, the AUC value of 0.9318, 0.9262, and 0.9247 were obtained when selecting formulas (12), (14), and (13), respectively.
Furthermore, we also compared the performance of our prediction model PADLMP with that of the RLSMDA [42], WBSMDA [39], and LRLSLDA [41] in LOOCV, since negative samples were not required in PADLMP, RLSMDA, WBSMDA, and LRLSLDA. The simulation results were shown in Figure 3. It is easy to see that PADLMP can achieve a reliable AUC of 0.9318, which is much higher than the AUC of 0.8104 and 0.9281 achieved by RLSMDA, WBSMDA, LRLSLDA, respectively, In addition, we can clearly see that the AUC value of the model LRSLDA is less than 0.5, which is obviously unreasonable. So based on prior knowledge [43], we subtract this value less than 0.5 from 1 and then we get the AUC value of LRSLDA being 0.5254.
Moreover, in order to further evaluate the prediction performance of PADLMP, the -fold cross validation was also implemented, in which all the known disease-LMPair association samples were randomly equally divided into parts, and − 1 parts were then used as training samples for model learning while the rest part was used as testing samples for model evaluation. Specifically, in this section, considering time complexity and costs, we would only implement 2fold and 5-fold cross validation to evaluate the prediction performance of PADLMP. In a similar way to that of LOOCV, all the disease-LMPairs without known relevance evidences would be considered as candidate samples in the -fold cross validation. Next, in case of the prediction performance bias caused by random division of the testing samples, we would repeat the random division of the testing samples and our simulations for 100 times, and then, the corresponding ROC curves and AUCs would be obtained in a similar way to that of LOOCV. Simulation results were shown in Table 1, and as a result, from the Table 1, it is easy to see that PADLMP can achieve the best prediction performance with average AUCs of 0.9090 and 0.8950 with Standard Deviation (STD) of 0.0264 and 0.0027 in the 2-fold and 5-fold cross validation, respectively, while setting = 2.
From the above descriptions, it is obvious that the newly proposed model PADLMP can achieve a reliable and effective prediction performance in both LOOCV and -fold cross validation. Therefore, we released the potential disease-LMPair associations with higher predicted relevance scores publicly (see Supplementary Table 4) and anticipated that these disease-LMPair associations may offer valuable information and clues for corresponding biological experiments and would be confirmed by experimental observations in the future.

Case Studies
Colon cancer is a malignant tumor that is usually found at the borders of rectum and sigmoid colon [44]. This is the third most common cancer and the third leading cause of cancer death in men and women in the United States [45]. However, patients with early colon tumors only suffer from subtle symptoms [46], which make the disease difficult to be detected. In addition, worse, it is reported that its incidence has an upward trend in recent years [47]. Therefore, there is an urgent need to predict potential miRNAs and lncRNAs associated with colon tumors. With the help of modern medicine, many miRNAs have been shown to be associated with colon tumors. For example, miRNA-145 targets the insulin receptor substrate-1 and thus inhibits the growth of colon cancer cells [48].
Moreover, as the second largest cause of cancer deaths in women, breast cancer accounts for the total number of cancers in women 22% [49,50]. Breast cancer is caused by a variety of molecular changes, traditionally diagnosed by histopathological features such as tumor size, grading, and lymph node status [49]. Studies have shown that lncRNAs and  [51,52]. In order to better diagnose and treat breast cancer, it is necessary to predict breast cancerrelated lncRNA or miRNAs and to identify lncRNA and miRNA biomarkers [52]. In addition, prostate cancer is a malignant tumor derived from prostate epithelial cells [53]. There are many factors, including age, family history of disease, and race, which may increase the risk of prostate neoplasms [54]. So far, many miRNAs and lncRNAs, such as miRNA has-let-7a-5p and lncRNA XIST in the prostate, have been found to be associated with prostate tumors.
As described previously, PADLMP has been demonstrated that it can achieve a reliable and effective prediction performance. Hence, in this section, case studies about above three kinds of important cancers based on top 5% of predicted results will be implemented to show the prediction performance of PADLMP. As illustrated in Table 2, the prediction results have been verified based on the recent updates in the databases such as lncRNADisease, MNDR v2.0, starBase v2.0, HMDD, miR2Disease, and miRCancer.

Discussion and Conclusion
Accumulating evidences show that the interaction of lncRNA-miRNAs is involved in the formation of many complex human diseases, such as breast cancer [16]; however, to our knowledge, there are no prediction models proposed for large scale forecasting the associations between diseases and LMPairs. Hence, based on the existing miRNA-disease associations, lncRNA-disease associations, lncRNA-miRNA interactions, and the assumption that genes with similar functions are often associated with similar diseases, we proposed a novel prediction model PADLMP to infer potential associations between diseases and LMPairs.
In this paper, we achieved the following contributions mainly: (1) we proposed the first computational model PADLMP for large-scale prediction of disease-LMPair associations, which can predict potential associations between diseases and lncRNA-miRNA pairs effectively. (2) We transformed the tripartite Disease-LncRNA-miRNA network into a bipartite disease-LMPair network, which greatly reduced the complexity of our prediction model. (3) Negative samples were not required in our prediction model. However, although PADLMP is a powerful tool to infer novel associations between diseases and lncRNA-miRNA pairs, there are some limitations still existing in our method. For example, firstly, although we introduced semantic similarity for diseases and LMPairs, but the calculation of Gaussian Interaction Profile Kernel Similarity greatly relied on known disease-lncRNA associations, disease-miRNA associations, and disease-LMPairs associations. Therefore, it would cause inevitable bias towards those well-investigated diseases and LMPairs. Secondly, PADLMP could not be applied to unknown diseases and LMPairs, which were poorly investigated and had not any known associations. In the future, we will try to design new methods that do not rely on the topological information of disease-LMPair association network to solve these limitations.

Supplementary Materials
There are four supplementary tables in this manuscript. And among them, Supplementary Table 1 is a description of the lncRNA-disease associations, in which there are 2048 lncRNA-disease associations included, and the 1st column represents the lncRNAs and the 2nd column represents the diseases. Supplementary Table 2 is a description of the miRNA-disease associations, which contains the ID of diseases, the ID of miRNAs, and the associations between diseases and miRNAs. Supplementary Table 3 is a description of the lncRNA-miRNA associations, which contains 20343 lncRNA-miRNA associations and in which the first column represents the lncRNAs and the second column represents the miRNAs. Supplementary Table 4 is a description of the predictive results of associations between diseases and lncRNA-miRNA pairs while adopting PADLMP to execute prediction. (Supplementary Materials)