Predicting Long Noncoding RNA and Protein Interactions Using Heterogeneous Network Model

Recent study shows that long noncoding RNAs (lncRNAs) are participating in diverse biological processes and complex diseases. However, at present the functions of lncRNAs are still rarely known. In this study, we propose a network-based computational method, which is called lncRNA-protein interaction prediction based on Heterogeneous Network Model (LPIHN), to predict the potential lncRNA-protein interactions. First, we construct a heterogeneous network by integrating the lncRNA-lncRNA similarity network, lncRNA-protein interaction network, and protein-protein interaction (PPI) network. Then, a random walk with restart is implemented on the heterogeneous network to infer novel lncRNA-protein interactions. The leave-one-out cross validation test shows that our approach can achieve an AUC value of 96.0%. Some lncRNA-protein interactions predicted by our method have been confirmed in recent research or database, indicating the efficiency of LPIHN to predict novel lncRNA-protein interactions.


Introduction
Long noncoding RNAs (lncRNAs), a class of important non-protein coding transcripts with lengths more than 200 nucleotides [1], have gained wide attention recently, and a large number of lncRNAs have been discovered by analysis of chromatin-state maps [2] and full-length complementary DNA (cDNA) [3] based on RNA-seq data [4]. Recent researches show that lncRNAs play critical roles in complex cellular processes, such as epigenetic regulation of gene expression [5][6][7][8][9], chromatin modification [10], and cell differentiation. Moreover, studies show that a number of lncRNAs are implicated in a range of human diseases [11][12][13]. Hence, uncovering the functions of lncRNAs is of great importance in understanding the mechanisms of biological processes.
Generally, almost all of the lncRNAs function through interactions with corresponding RNA binding proteins [14][15][16]. In turn, RNA binding proteins can interact with different lncRNAs to regulate diverse cellular processes [17,18]. Thus, identifying the potential lncRNA-protein interactions is critical to understand the functions of lncRNAs. Since experimental detection of unknown lncRNA-protein interactions is time consuming and costly, some computational approaches have been proposed for lncRNA-protein interaction prediction. In 2011, CatRAPID was developed by Bellucci et al. [5], in which lncRNA-protein pairs are encoded into feature vectors and scored by using matrix computation. In the same year, a method named RPIseq was introduced by Muppirala et al. [19] using random forest (RF) and support vector machines (SVM) classifiers to predict lncRNA-protein interaction and RPIseq only uses the sequence information of lncRNAs and proteins. In 2013, Lu et al. [20] introduced a method named lncPro, which predicts lncRNA-protein interactions by using scores yielded by amino acid and nucleotide sequences and Fisher's linear discriminant method.
In this paper, we introduce a network-based method, lncRNA-protein interaction prediction based on Heterogeneous Network Model (LPIHN), to predict the interactions between lncRNAs and proteins. First, we construct a heterogeneous network with the use of protein-protein interaction (PPI), lncRNAs expression similarity, and known lncRNAprotein interactions. Then, a random walk with restart is implemented on the heterogeneous network to infer novel lncRNA-protein interactions. We compare the performance with two network-based methods including PRIoritizatioN and Complex Elucidation (PRINCE) [21] and the random walk based method (RWR) [22]. In the leave-one-out cross validation (LOOCV) test we implement, LPIHN outperforms PRINCE and RWR by a significant margin. Moreover, we identify several lncRNA-protein interactions that are supported by evidence in recent literature or database, which shows the practical value of our method.

lncRNA-Protein
Interactions. The development of bioinformatics and experimental technologies has made the global lncRNA-protein interaction network available. NPinter (http://www.bioinfo.org/NPInter/) is the up-todata database that has collected experimentally validated interactions between noncoding RNAs (ncRNAs) and other biomoleculars [23]. The research done by Shang et al. [24] has extracted lncRNA-protein interactions from NPinter and made detailed and comprehensive analysis about the lncRNA-protein network.
In this paper, we download known ncRNA-protein interaction dataset from Npinter 2.0 database in November 2013 and then filter the ncRNAs and their interaction proteins, by restricting the organism and the type of ncRNAs to "Homo sapiens" and "NONCODE," respectively. Then we further select the lncRNAs from these ncRNAs according to human lncRNA dataset from NONCODE 4.0 database [25] and map the lncRNA ID and protein ID into NONCODE 4.0 ID and string ID separately. is defined as the adjacency matrix of lncRNA-protein interactions, in which ( , ) is 1 if there is an interaction between protein and lncRNA , otherwise 0.

lncRNA Expression
Similarity. The lncRNA expression profiles are obtained from NONCODE 4.0 database, including the expression profiles of 89,369 lncRNA in 24 human tissues or cell types. Then Pearson correlation coefficient (PCC) [26][27][28][29][30][31] between the expression profiles of each pair of lncRNAs is calculated as the lncRNA expression similarity. We define = { 1 , 2 , . . . , 24 } and = { 1 , 2 , . . . , 24 } as two expression profiles of lncRNA and , respectively, which contain expression value of 24 human tissues or cell types. The expression similarity matrix of the lncRNAs SL can be calculated as where SL( , ) in row and column represents the absolute value of PCC between lncRNA and , cov( , ) is the covariance of and , and and are the standard deviation of and , respectively. Calculate PCC between the expression profiles of each pair of nodes which is widely used in bioinformatics research. Hence, the similarity calculated based on the expression data of lncRNA can obtain reliable performance.

Protein-Protein Interactions.
We obtain PPI data from STRING 9.1 database [32], which contains weighted protein interactions derived from computational prediction methods, high-throughput experiments, and text mining. Then, we remove the redundant PPI data, resulting in 804 PPI data and corresponding interaction scores according to the known lncRNA-protein dataset, and all PPI pairs are treated as identically reliable. The symmetric matrix SP is defined as the interaction matrix, in which SP is the interaction score of vertices and . Formally, define a diagonal matrix , in which ( , ) is the sum of row of SP; the normalization of SP is defined by the following function: where SP is a normalized form of SP.

The Heterogeneous
Network. 1 ( , 1 , SL) is defined as the lncRNA-lncRNA similarity network, in which = { 1 , 2 , . . . , } represents the set of lncRNAs, 1 = { 1 , 2 , . . . , } represents sets of edges between vertices, and are connected if the similarity SL calculated by PCC between and is more than 0. The PPI network 2 ( , 2 , SP ) can be constructed analogously, and vertices set = { 1 , 2 , . . . , } represents the set of proteins. 2 represents sets of edges between proteins; and will be connected if the normalized interaction score SP between vertices and is more than 0. In the lncRNA-protein network, and are connected if ( , ) is 1. lncRNA-protein heterogeneous network is constructed by connecting the aforementioned lncRNA-lncRNA similarity network and PPI network together with lncRNA-protein interaction network (Figure 1(b)). Then, a random walk with restart will be implemented on the network.

LPIHN Method.
LPIHN is proposed to score proteins for each lncRNA by implementing random walk with restart on the heterogeneous network, based on the assumption that similar lncRNAs tend to exhibit similar interaction patterns with proteins. The procedure of random walk with restart is that an iterative walker starts at a source node with an initial probability and transits to a randomly selected direct neighbor; in the process of random walking, the walker can restart at source node with some probability in every time step. Hence, when implementing the random walk on the heterogeneous network, the initial probability, transition matrix, and restart probability should be determined based on the information supplied by the heterogeneous network. In the procedure of predicting the potential proteins for lncRNA , let 0 represent the initial probability of walker starting at each node, where and the proteins that are known to interact with are assigned positive values and the remaining nodes are assigned zero. This assignment suggests that the random walker starts at or the proteins interact with . Let represent the relevance of to all other nodes, in which the th element indicating the probability of the · · · · · · · · · · · · · · · · · · · · · . . . . . . · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · . . . . . . . . . Figure 1: A simple example of the procedure of predicting lncRNA-protein interactions with LPIHN. (a) The lncRNA-lncRNA similarity matrix is calculated by using the expression profiles of lncRNAs to calculate the PCC of each pair of lncRNAs. The profile of known lncRNAprotein interactions is obtained where the value of and is 1 if there exists interaction between lncRNA and protein , otherwise 0. The PPI profile is obtained based on the normalized score of PPI. (b) The upper purple network is the lncRNA-lncRNA similarity network, the lower red network is the PPI network, and both of them are constructed based on the corresponding profile in (a). The heterogeneous network is constructed by connecting the lncRNA-lncRNA similarity network and PPI network together with the known lncRNA-protein interaction network. Purple triangles indicate lncRNAs, red circles proteins, purple edges lncRNA-lncRNA similarities, red edges proteinprotein interactions, and black dotted edges known lncRNA-protein interactions. (c) Our method assigns a score to each of the candidate proteins of a query lncRNA, with the random walk with restart implemented on the heterogeneous network. The candidate proteins are ranked based on the score. random walker is found at node at step . +1 can be decided by the following iterative equation: where ∈ (0, 1) represents the restart probability of random walk. is the transition matrix and 0 is the initial probability of the random walk. All of them are detailed later.
Given a query lncRNA , is the seed node in the lncRNA network, the probability of vertex is 1, and other vertices in the lncRNA network are assigned 0, forming the initial probability of lncRNA network V 0 . If protein interacts with lncRNA , then is the seed node in the protein network. The initial probability vector of protein network 0 is formed by assigning equal probabilities to the protein seed nodes, under the condition that the sum is equal to 1. For the heterogeneous network, the initial probability is We use the parameter ∈ (0, 1) to weight the importance of lncRNA network and protein network. If = 0.5, lncRNA-lncRNA similarity network and PPI network are equally weighted. If < 0.5, the random walk tends to return to the protein network.
In order to implement random walk on the heterogeneous network, the transition matrix must be defined. We define = [

PL LP
] as the transition matrix, where and are the subnetwork transition matrix showing the probability of the random walker transiting from one protein (lncRNA) to another protein (lncRNA) in the process of random walk. PL indicates the probability of the random walker transiting from protein network to lncRNA network and LP indicates the movement from lncRNA network to protein network. In the process of transition, we define as the probability of random walker transiting from protein network to lncRNA network and vice versa. is defined as follows.
The probability of the random walker transiting from protein to is defined as ∑ ( , ) = 0 means that only connects to proteins, and the walker can only transit randomly to the direct neighbor protein in the PPI network next step. Otherwise, the walker can transit to the lncRNA-lncRNA network from with probability ; under that condition, the probability of transiting to should multiply 1 − .
Analogously, the probability from lncRNA to can be defined as The probability from protein to lncRNA is defined as ∑ ( , ) ̸ = 0 means that connects to at least one lncRNA, and the walker can transit to lncRNA-lncRNA network from with probability ; under that condition, we can further calculate the probability of transiting to . Otherwise, the probability of transiting to is 0.
The probability from lncRNA to protein can be defined in a similar manner as As the initial probability 0 and the transition matrix are defined, the random walk with restart can be implemented on the heterogeneous network. After several iterations, the change between and +1 is less than 10 −10 , indicating that a stable probability ∞ = [ 2.6. Leave-One-Out Cross Validation Test. We implement a LOOCV procedure to test the performance of LPIHN. With each cross validation trial, each known lncRNA-protein interaction is used as test data and the rest are taken as training dataset. Then the method is evaluated by successfully reconstructing the hidden interaction.
ROC curves are used to evaluate the performance of the method; for a rank threshold , sensitivity (Sn) and specificity (Sp) are defined as follows: TN and TP represent the number of negative sites and positive sites that are correctly predicted. FN and FP represent the number of positive sites and negative sites that are wrongly predicted. We plot Sn versus 1 − Sp at different thresholds separating the prediction [33], which is the ROC curve. We calculate the AUC, which is the area under the ROC curve. Meanwhile, some common used measurements, namely, accuracy (Acc), precision (Pre), and Matthew's correlation coefficient (MCC), are calculated as follows: We also use the precision versus recall and fold enrichment to measure the performance. For lncRNA , the top ranked proteins are considered to interact with in our method. Precision means the fraction of true lncRNA-protein interactions that ranked among the top in the procedure of cross validation. Recall means the fraction of hidden interaction is reconstructed that ranked within top . In this paper, another measure for the evaluation of the method is fold enrichment. For a query lncRNA, the number of its candidate proteins is defined as , the test protein is ranked in the candidate protein set, and the fold enrichment can be calculated by the following formula: fold enrichment = /2/ , and here we use the average fold enrichment of all test data for assessment.

Comparison with Other Network-Based Methods.
We compare the performance of LPIHN with other two networkbased methods as follows: PRINCE [21] and RWR [22]. In   RWR method, for one lncRNA, at least two proteins are required to perform LOOCV. Therefore, we only consider lncRNAs that are interacting with at least two proteins. After the preprocessing, we obtain 1,113 lncRNAs and 96 proteins. And 4,870 lncRNA-protein interactions are regarded as goldstandard dataset to be used in cross validation. Then, LOOCV is implemented to evaluate the performance of these methods. According to previous research [34], we set = 0.5, = 0.5 here and fix to 0.3, as it has been reported that the restart probability has a very slight effect on the result [22,35]. The ROC curves of LPIHN, PRINCE, and RWR are plotted in Figure 2, which clearly shows that the ROC curve of LPIHN is consistently above the other two methods. From Table 1, we can see that LPIHN achieves an AUC of 96.0%. The result is higher than PRINCE and RWR, which achieves AUC of 90.6% and 88.1%, respectively. This phenomenon indicates that the performance of LPIHN is better than PRINCE and RWR. To further evaluate that the prediction obtained by our method is not generated by chance, we perform the LOOCV test on random lncRNAprotein interaction network. The lncRNA-protein interaction network is randomized for 1000 times, which means we select seed proteins randomly for each lncRNA. The AUC value of randomization process is 53.0%, which is much lower than AUCs of other three methods. This indicates that our method can discover potential lncRNA-protein interactions.
Besides AUC value, we also compare the Sn and Sp of these methods (  Figure S1  The curves of precision and recall of LPIHN, PRINCE, and RWR with the varying threshold 1 ≤ ≤ 96 are shown in Figure 4(a), which shows that LPIHN can achieve the highest precision of 72.1%, while PRINCE and RWR methods achieve lower results with 20.6% and 17.3%, respectively. Meanwhile, compared with PRINCE and RWR, the LPIHN method achieves a higher precision at every recall value. Moreover, the comparison of these methods in terms of average fold enrichment is shown in Figure 4(b). For all of the 96 proteins, LPIHN achieves an average enrichment score of 18.9, which is 10.6% and 10.1% compared to PRINCE and RWR, respectively.
To further evaluate the performance of LPIHN, we implement case studies for two lncRNAs including NONHSAT010657 (HNRNPU-AS1) and NONHSAT022127 (MALAT1), which are related to 12 and 24 lncRNA-protein interactions, respectively. The comparison between LPIHN, PRINCE, and RWR in terms of Sn, Acc, Pre, and MCC is shown in Figure 5, which indicates that LPIHN achieves better performance than PRINCE and RWR. In particular, when Sp is 99.0%, for lncRNA HNRNPU-AS1, the Sn, Acc, Pre, and MCC values of LPIHN are increased by 16.7%, 2.1%, 25.0%, and 22.9% when compared with PRINCE, and 7.3%, 1.1%, 8.3%, and 10% when compared with RWR, respectively. For lncRNA MALAT1, the Sn, Acc, Pre, and MCC values of LPIHN are increased by 20.9%, 5.3%, 1.7%, and 25% when compared with PRINCE and 45.9%, 11.5%, 1.7%, and 35.4% when compared with RWR, respectively. Moreover, we reconstruct the interaction network of lncRNA HNRNPU-AS1 by using the prediction data of these three methods ( Figure 6). Among the 12 true lncRNA-protein interactions of lncRNA HNRNPU-AS1, LPIHN successfully reconstructs 9 interactions, while PRINCE and RWR retrieve lower interactions of 7 and 6, respectively.
To verify the effect of the number of interactions on the performance of the proposed method, we group the lncRNAs into four equal intervals according to the different number of interactions. Then, AUC values of different intervals are plotted in Figure S2. The result shows that the more the proteins that interact with a query lncRNA are, the better the performance the proposed method can achieve.

Comparison with Existing Methods.
We also evaluate the performance of LPIHN on lncRNA HNRNPU-AS1 and MALAT1 with existing methods: lncpro and RPIseq. RPIseq yields two types of scores based on support vector machine (SVM) and random forest (RF), respectively. ROC curves and AUC values of these methods are shown in Figure 7. It is obvious that the ROC curve of LPIHN is consistently above the other methods on both HNRNPU-AS1 and MALAT1. For lncRNA HNRNPU-AS1, the AUC value of LPIHN is 34.8%, 59.9%, and 39.4% higher than lncpro, RPIseq-RF, and RPIseq-SVM, respectively. The AUC value of LPIHN is 30.6%, 41.9%, and 35.2% higher than lncpro, RPIseq-RF, and RPIseq-SVM on MALAT1, respectively. All the evaluations above show that LPIHN outperforms the other two networkbased methods and existing methods, which indicates that LPIHN is a powerful method to predict the interactions between lncRNAs and proteins.

Case Studies.
The proposed method is able to predict novel lncRNA-protein interactions for the query lncRNA. For each lncRNA, the proteins ranked within top 10 (this is a user-defined threshold) are considered as the potential proteins interacting with the query lncRNA. To further evaluate the efficiency of LPIHN to predict novel lncRNAprotein interactions, we present case studies of five lncRNAs, including NONHSAT137627 (FTX), HNRNPU-AS1, MALAT1, NONHSAT004412 (RP4-665J23.1), and NONHSAT016118 (RP11-18I14.10). Figure 8 shows the predicted network for these lncRNAs, where the known lncRNA-protein interactions and top 5 ranked predictions are displayed. For lncRNA FTX, HNRNPU-AS1, and MALAT1, the top 10 predictive proteins are listed in Table 2 (

Conclusion
With the development of the research of lncRNA, computational methods have been published for the predictions of lncRNA-protein interactions. In this paper, we introduce a network-based method LPIHN to predict the proteins interacting with lncRNAs. First, a heterogeneous network is constructed by connecting PPI and lncRNA-lncRNA similarity network using known lncRNA-protein interactions. Then, an iteratively random walk is implemented on the heterogeneous network, which can score proteins for each lncRNA. Finally, LOOCV is implemented to evaluate the performance of our method.  top-ranked lncRNA-protein interactions predicted by our method are supported by existing literature or database. The good performance and the practical value show that our approach is a promising way to predict potential lncRNAprotein interactions.
While the results are promising, the LPIHN method shows some limitations. Firstly, we test our method only on one database (i.e., NPinter 2.0). From the known lncRNAprotein interaction dataset, we observe that each lncRNA interacts with about 4.37 proteins on average. Due to the relative sparsity of the known lncRNA-protein interactions, the network-based method may produce biased predictions.
This situation can be improved by the increase of comprehensive lncRNA-protein interactions datasets. Secondly, skewed degree distribution of the network may affect the result of our prediction; adding some appropriate resistance in the process of random walk may improve the performance of our method. Thirdly, the proposed method can only predict similarity between lncRNAs that have expression profile, which indicates that the increase of lncRNA-protein interaction datasets may lead to the incomplete coverage of the lncRNA-lncRNA similarity network. This situation can be improved by adding information such as known lncRNAprotein interactions.