Recently, accumulating laboratorial studies have indicated that plenty of long noncoding RNAs (lncRNAs) play important roles in various biological processes and are associated with many complex human diseases. Therefore, developing powerful computational models to predict correlation between lncRNAs and diseases based on heterogeneous biological datasets will be important. However, there are few approaches to calculating and analyzing lncRNA-disease associations on the basis of information about miRNAs. In this article, a new computational method based on distance correlation set is developed to predict lncRNA-disease associations (DCSLDA). Comparing with existing state-of-the-art methods, we found that the major novelty of DCSLDA lies in the introduction of lncRNA-miRNA-disease network and distance correlation set; thus DCSLDA can be applied to predict potential lncRNA-disease associations without requiring any known disease-lncRNA associations. Simulation results show that DCSLDA can significantly improve previous existing models with reliable AUC of 0.8517 in the leave-one-out cross-validation. Furthermore, while implementing DCSLDA to prioritize candidate lncRNAs for three important cancers, in the first 0.5% of forecast results, 17 predicted associations are verified by other independent studies and biological experimental studies. Hence, it is anticipated that DCSLDA could be a great addition to the biomedical research field.
For long time, RNA was just considered to be transcriptional noise and intermediary between a DNA sequence and its encoded protein [
Nowadays, with advent of many biological datasets, such as LncRNADisease [
We downloaded known disease-miRNA associations from the Human MicroRNA Disease Database (HMDD) in July 2017 (see Supplementary file
We downloaded known miRNA-lncRNA associations dataset from starBase v2.0 dataset in July 2017, which provided the most comprehensive experimentally confirmed lncRNA-miRNA interactions based on large scale CLIP-seq data. After data preprocessing (including elimination of duplicate values, erroneous data, disorganized data, and so on),
In order to evaluate the performance of DCSLDA, the newly lncRNA-disease associations were downloaded from LncRNADisease database, which integrated more than 1000 lncRNA-disease entries and 475 lncRNA interaction entries, including 321 lncRNAs and 221 diseases from ~500 publications. In this dataset, after duplicate associations and the lncRNA-disease associations involved in either diseases or lncRNAs which were not contained in the
For calculating the functional similarity between diseases, we introduced the concept of social network. In the social network, for any two nodes, we can calculate the similarities between them by comparing and integrating the similarities of nodes associated with these two nodes. In this section, based on the assumption that similar diseases tend to show a similar interaction and noninteraction pattern with the miRNAs, we calculated the disease similarity in the disease-miRNA interactive network. As illustrated in Figure
The flowchart of functional similarity calculation based on information of miRNA includes three steps: (1) constructing known disease-miRNA association and miRNA-lncRNA association network respectively; (2) obtaining contribution of each miRNA; (3) calculating functional similarity for diseases and lncRNAs, respectively.
Based on the assumption that similar lncRNAs tend to show a similar interaction and noninteraction pattern with the miRNAs, we can calculate the lncRNA similarity in the lncRNA-miRNA interactive network. Similar to the calculation procedures of disease functional similarity, first, we constructed lncRNA-miRNA interactive network from known lncRNA-miRNA associations (
Based on the assumptions that similar diseases tend to show a similar interaction and noninteraction pattern with the miRNAs and similar miRNAs tend to show a similar interaction and noninteraction pattern with the lncRNAs, we proposed a novel model, DCSLDA, based on miRNAs and distance correlation set to predict potential disease-lncRNA associations. As illustrated in Figure
The procedures of DCSLDA.
On the basis of the above descriptions and letting
We can construct a
where
Let
In
And thereafter, for any given node
Distance correlation set of D1 with r=2.
Based on (
Based on (
To evaluate the prediction performance of DCSLDA, first of all, we implemented LOOCV (leave-one-out cross-validation) to compare DCSLDA with HGLDA [
17 predicted lncRNA-disease pairs with high predicted value while DCSLDA was applied to three important kinds of cancer (breast cancer, colorectal cancer, and lung cancer).
Cancer | LncRNA | PMID |
---|---|---|
Breast cancer | KCNQ1OT1 | 21304052; 26323944 |
| ||
Breast cancer | MALAT1 | 24525122; 19379481 |
| ||
Breast cancer | XIST | 27248326 |
| ||
Breast cancer | NEAT1 | 25417700; 28034643 |
| ||
Breast cancer | LINC00657 | 26942882 |
| ||
Breast cancer | SNHG16 | 28232182 |
| ||
Breast cancer | CASP8AP2 | 28388918 |
| ||
Breast cancer | PPP1R9B | 26387546 |
| ||
Breast cancer | TUG1 | 27791993 |
| ||
Colorectal cancer | KCNQ1OT1 | 16965397; 11340379 |
| ||
Colorectal cancer | MALAT1 | 25025966 |
| ||
Colorectal cancer | XIST | 17143621 |
| ||
Colorectal cancer | NEAT1 | 26552600 |
| ||
Colorectal cancer | SNHG16 | 26823726 |
| ||
Colorectal cancer | CASP8AP2 | 22216762 |
| ||
Lung cancer | MALAT1 | 20937273; 24757675; 24667321 |
| ||
Lung cancer | XIST | 27501756 |
Performance comparisons between DCSLDA and HGLDA based on the rankings of ten lncRNA-disease associations related to three important kinds of cancer (breast cancer, colorectal cancer, and lung cancer).
Cancer | LncRNA | DCSLDA | HGLDA |
---|---|---|---|
Breast cancer | KCNQ1OT1 | 1 | 8 |
| |||
Breast cancer | MALAT1 | 4 | 30 |
| |||
Breast cancer | XIST | 5 | 1 |
| |||
Breast cancer | NEAT1 | 8 | 12 |
| |||
Breast cancer | SNHG16 | 12 | 3 |
| |||
Colorectal cancer | KCNQ1OT1 | 1 | 5 |
| |||
Colorectal cancer | MALAT1 | 4 | 3 |
| |||
Colorectal cancer | XIST | 5 | 1 |
| |||
Lung cancer | MALAT1 | 4 | 9 |
| |||
Lung cancer | XIST | 5 | 1 |
| |||
Average ranks | 4.9 | 7.3 |
According to the lncRNA-disease association datasets downloaded from LncRNADisease database, DCSLDA and HGLDA were applied in the framework of LOOCV, respectively. While the LOOCV was implemented for investigated diseases and lncRNAs, each known lncRNA-disease association would be left out in turn as test sample, and then we further evaluated how well this association ranked relatively to the candidate samples. Here, the candidate samples comprised all potential lncRNA-disease pairs without confirmed associations. Therefore, after the implementation of DCSLDA was completed, the rank of each left-out testing sample relative to the candidate samples could be further obtained. And then, the testing samples with a prediction rank higher than the given threshold were considered successfully predicted. Thus, we could further obtain the corresponding true positive rates (TPR, sensitivity) and false positive rates (FPR, 1-specificity) by setting different thresholds. Here, sensitivity refers to the percentage of test samples that were predicted with ranks higher than the given threshold, and the specificity was computed as the percentage of negative samples with ranks lower than the threshold. Therefore, the receiver-operating characteristics (ROC) curves could be drawn by plotting TPR versus FPR at different thresholds. And then, the areas under ROC curve (AUC) would be further calculated to evaluate the prediction performance of DCSLDA. An AUC value of 1 represented a perfect prediction while an AUC value of 0.5 indicated purely random performance.
The results of the performance comparison between DCSLDA and HGLDA were shown in Figure
Performance comparisons between DCSLDA and HGDLA in terms of ROC curve and AUC based on LOOCV.
Cancer has become one of the most dangerous killers for human beings [
Performance evaluation of potential lncRNA-cancer association prediction in terms of ROC curve and AUC based on LOOCV.
From Figure
In formula (
Comparison of effects of the disease functional similarity and lncRNA functional similarity to the prediction performance of PCSLDA in the framework of LOOCV with
Obviously, DCSLDA can predict all potential relationships between diseases and lncRNAs in
In the world, breast cancer is the most prevalent cancer in women and a major public health problem. Several studies have focused on studying this disease, but more are needed, especially at the genetic and molecular levels [
Colorectal cancer (CRC) is a leading cause of cancer deaths worldwide, one of the fundamental processes driving the initiation and progression of CRC is the accumulation of a variety of genetic and epigenetic changes in colon epithelial cells. Colorectal cancer is usually caused by the combination of various factors, such as genetic and epigenetic changes [
Over the past 30 years, the morbidity and mortality of lung cancer have been increasing and the cancer has the highest incidence and mortality across the world [
In addition, performance comparisons between DCSLDA and HGLDA were implemented according to the rankings of these disease-related lncRNAs in the case studies of breast cancer, colorectal cancer, and lung cancer (see Table
In recent years, plenty of studies have generated an enormous amount of biological data related to lncRNAs. Accumulating evidence shows that lncRNAs have played a very important role in the biological functions, and the study of lncRNA-disease association prediction is of great significance to human beings. However, there is a few computational models for predicting potential disease-lncRNA associations based on the information of miRNA. To utilize the wealth of disease-miRNA, miRNA-lncRNA, and disease-lncRNA associations data collected from three datasets and recently published in experimental literature, in this article, the novel model of DCSLDA was developed to predict potential disease-lncRNA associations. We calculated distance correlation set of each node based on disease-miRNA-lncRNA interactive network first and then further integrated disease functional similarity and lncRNA functional similarity for DCSLDA. The important difference from previous computational model is that DCSLDA does not rely on any known disease-lncRNA associations and it predicts disease-lncRNA associations only based on disease-miRNA-lncRNA interactive network. In order to evaluate the prediction performance of DCSLDA, the validation frameworks of LOOCV were implemented based on known disease-lncRNA and cancer-related-lncRNA associations downloaded from LncRNADisease database. And case studies were further implemented to three important cancers (breast cancer, colorectal cancer, and lung cancer) based on recently published experimental literature. The simulation results show that DCSLDA can achieve reliable and excellent prediction performance and is superior to the state-of-the-art methods. Hence, it is anticipated that DCSLDA could play an important role in the prospective biomedical researches.
Disease functional similarity plays an important role in disease-related molecular function research. Functional associations between disease-related genes are often used to identify pairs of similar diseases from different perspectives. Calculating lncRNA functional similarity could benefit lncRNA function inference and disease-related lncRNA prioritization. Therefore, based on the two assumptions that (1) similar diseases tend to show a similar interaction and noninteraction pattern with the miRNAs and (2) similar lncRNAs tend to show a similar interaction and noninteraction pattern with the miRNAs, DCSLDA was developed to predict potential disease-related lncRNA by integrating lncRNA functional similarity and disease functional similarity. Simulation results indicated that the prediction performance of DCSLDA will be significantly improved by disease similarity and lncRNA similarity.
However, there are also some limitations in our method. Firstly, DCSLDA measures the correlations between lncRNAs and investigated diseases by integrating walks with different lengths in a lncRNA-miRNA-disease network, which is constructed by combining the known disease-miRNA network, miRNA-lncRNA network, and disease similarity network. The value of distance threshold parameters
The authors declare that there are no conflicts of interest regarding the publication of this paper.
The project is partly sponsored by the Natural Science Foundation of Hunan Province (No. 2018JJ4058, No. 2017JJ5036), the National Natural Science Foundation of China (No. 61640210, No. 61672447), and the CERNET Next Generation Internet Technology Innovation Project (No. NGII20160305).
Supplementary file 1: the known miRNA-disease associations for constructing the ASC1. We list 5430 known miRNA-disease associations which were collected from HMDD dataset to construct the ASC1. Supplementary file 2: the known lncRNA-miRNA associations for constructing the ASC2. We list 10195 known lncRNA-miRNA associations which were collected from starBase v2.0 database to construct the ASC2. Supplementary file 3: the known lncRNA-disease associations. We list 203 high-quality lncRNA-disease associations which were collected from LncRNADisease database to validate the performance of our method. Supplementary file 4: the top 0.5% results were listed to validate the performance of our method.