SEBGLMA: Semantic Embedded Bipartite Graph Network for Predicting lncRNA-miRNA Associations

,


Introduction
As a regulatory chemical molecule, noncoding RNAs participate in many biological activities, such as epigenetic control, gene transcription, translation, chromosome organization, cell proliferation, and development programs [1].In the biological regulatory network, each RNA plays a unique role.Long noncoding RNAs (lncRNA) with more than 200 nucleotides can take part in a variety of biological activities including cell diferentiation, cell growth, disease treatment, and gene transcription [2].Te diversity and additional functions of lncRNAs have risen extensive attention.Micro RNA (miRNA) is the most common target RNA of lncRNA with 19-22 nucleotides [3,4].Te miRNAs transcribed by viruses change the number of proteins and immune resistance by afecting the host messenger RNA (mRNA) transcription efciency.Recently, increasing evidence emphasizes the role of long noncoding RNAs (lncRNAs) as the epigenetic factor in disease occurrence and development.Moreover, researchers are more focused on their relationships with downstream target miRNAs [5].
LncRNA has a similar structure and transcription to mRNA.Previous research studies have indicated that lncRNA and miRNA can combine with each other to control gene expression.LncRNA has the ability to regulate miRNA functions by serving as competing endogenous RNA (ceRNA) to intervene in miRNA sponging and alter expression levels [6,7].Meanwhile, miRNA regulates the expression stability of lncRNA in an Argonaute2-dependent manner by reducing the host lncRNA for the virus.LncRNA provides a reliable chemical combination platform for its special segment composition.It limits the ability of miRNA to interfere with the mRNA protein encoding process by competitively binding miRNA in cells.
Up to now, only a small part of the working mechanism of lncRNA-miRNA association (LMA) has been studied.Tere are still a large number of lncRNA-MiRNA pairs that need to be explored.However, the existing research results have proved that tumors can be inhibited by regulating the corresponding LMA interaction.In addition to the ceRNA mechanism described above, researchers also found that LMA plays a crucial role in regulating the pathogenesis mechanisms of cancers, including ephemeral molecular transition (EMT), cancer stem cells (CSCs), drug resistance, and other molecular mechanisms.For instance, recent research studies have uncovered partial lncRNA-miRNA association networks hidden in breast, bladder, and colorectal cancer [8][9][10].Terefore, identifying the associations between lncRNAs and miRNAs will beneft regulating the quantities of diferent RNAs for the treatment of diseases and become chemical indicators for diagnosis and prognosis.Feasible prediction models are urgently needed to fnd the potential LMAs for further understanding the mechanism of gene molecular networks on the molecular level.With the advancement of gene sequencing and similarity calculation.Many advanced methods are developed by researchers to detect possible LMAs.
Te wet experiment is a traditional method to determine LMAs.Zhang et al. [11] fnd that the downregulation of miR-7 in BCSCs might be indirectly attributed to lncRNA HOTAIR by implementing ChIP-PCR and Double-Luciferase Reporter assay.Kallen et al. [12] revealed that H19 is an important regulator of the major let-7 family of microRNAs by crosslinking and real-time PCR.In recent years, the increasing clinical experimental data has provided an available platform for computer-aided prediction systems.Wang et al. [13] developed the KATZ model to construct a heterogeneous network.Te network only utilized the topology information to generate the association scores of lncRNA-miRNA pairs.Yang et al. [14] proposed the multiple-index latent factor model (MILFM), which projects the predicting information onto a few local subspaces to obtain key characteristics for further ftting by regularization.Zhang et al. [15] proposed the sequence-derived linear neighborhood propagation method (SLNPM) to fnd the potential interacting lncRNA-miRNA pairs.Specifcally, the similarity information combination and interaction profle information combination are applied in evaluating the weighted averages.Huang et al. [16] developed a novel group preference Bayesian collaborative fltering (GBCF) model for picking up a top-k probability ranking list for an individual miRNA or lncRNA.Xu et al. [17] proposed a structural perturbation method for predicting lncRNA-miRNA interactions (SPMLMI) which utilized the Pearson correlation coefcient for measuring lncRNA and miRNA similarity and form a two-layer relationship network.Recently, Liu et al. [18] developed the logistic matrix factorization with a neighborhood regularized (LMFNRLMI) model to utilize the strongest adjacent relation in the neighborhood and established an adjacency matrix through the K nearest neighbor approach to infer LMAs.
In this paper, we designed a novel model named semantic embedded bipartite graph networks for predicting lncRNA-miRNA associations (SEBGLMA).Tis method combines K-mer segmentation, word2vec, Gaussian interaction profle (GIP), graph convolution network (GCN), and rotation forest (RoF).Te structure of this model can be separated into the following sections.(1) Te sequences of RNAs are frst segmented into subsequences to generate corpuses for word2vec which record diferent combinations of amino acids by K-mer segmentation.(2) Send the corpuses into word2vec to construct Lnc2Vec and Mi2Vec models for extracting semantic attribute characteristics.(3) Feed the known lncRNA-miRNA associations into the GIP model to obtain sample self-similarities with the same class, and utilize the self-similarities to complete the adjacency matrix.(4) Input the adjacency matrix and semantic characteristics into GCN to form a bipartite graph model.Te outputs of GCN are regarded as the fusion features of nodes combining attribute and behavior information.(5) Send the features into the RoF for predicting the possible LMAs.Moreover, the case studies of hsa-miR-497-5P and NON-HSAT022145.2are also conducted.Tis paper makes the following contributions: (1) the RNA functional similarity of the same class is used to enrich the edges in the graph; (2) the attribute and behavior characteristics of lncRNAs and miRNAs have been efectively fused to improve sample diversity; (3) our model has great efects on screening candidates for subsequent clinical trials.Te workfow of SEBGLMA is displayed as Figure 1.
Te rest architecture of this study is arranged as follows: in Section 2, we introduce the benchmark data set for LMA prediction and diferent parts of the proposed model.In Section 3, we discuss the criteria, performance of the proposed model, and comparisons.Te conclusion and feature work are illustrated in Section 4.

Datasets.
In this paper, the benchmark dataset is established based on the clinical database lncRNASNP2 (https:// bioinfo.life.hust.edu.cn/lncRNASNP#!/).LncRNASNP2 ofers numerous interactions between lncRNAs and miRNAs which have been clinically validated.Te specifc subjects of this database include human and mouse [19].It also provides information on single-nucleotide polymorphisms (SNPs), mutations, and diseases about lncRNA.Tis study excavated 780 lncRNAs and 275 miRNA-related interaction data from the database and collected 10597 interacting pairs.Furthermore, the corresponding amino acid sequences of lncRNAs were obtained from the LNCipedia database (https://lncipedia.org/)[20], and sequences of miRNAs were achieved from the MiRbase database (https://www.mirbase.org/index.shtml)[21].Finally, there 2 International Journal of Intelligent Systems are 4966 interacting pairs remaining in the benchmark dataset, involving 770 lncRNAs and 275 miRNAs after eliminating redundant information.Table 1 gives the statistics of the benchmark data set.

GIP Kernel.
Te similarities between the samples of the same category calculated by functional diference can explicably describe the diferences between samples, compared with the similarities obtained by calculating the numerical and spatial distance between sequences.Te genes with more similarities usually have a similar interoperating mechanism.To be specifc, lncRNAs that combine the same target miRNAs often show more similarities.
Hence, the Gaussian interaction profle (GIP) kernel was utilized to separately describe the lncRNA and miRNA selfsimilarities for further completing the adjacency matrix   which describes the relationships between the nodes of the graph network [22][23][24].Te lncRNA-miRNA associations ofer a topological structure for the GIP kernel to generate two types of functional self-similarities, which are denoted as L GIP and M GIP .Te vectors V(l i ) and V(m i ) represent the internal associations of lncRNA and miRNA, respectively.Take miRNA as an example, V(m i ) is constructed as a binary sentence by detecting whether the miRNA interacts with a series of lncRNAs.Te vector element corresponding to the interaction is marked as 1, and the rest is marked as 0 [25].Tus, the self-similarities L GIP and M GIP can be generated by the following equations: where l i and l j represent the ith and jth lncRNA, m i and m j represent the ith and jth miRNA.λ is the normalized kernel bandwidth parameter of GIP kernel similarity, λ ′ is the original bandwidth parameter.Te defnitions of λ and λ ′ are as follows: Finally, the self-similarities L GIP and M GIP composed the blocks LL and MM.Te completed adjacency matrix is shown as follows: where A is a 1045 × 1045 matrix including the topological relationships of 770 lncRNA and 275 miRNA nodes.Blocks LL and MM represent the self-similarity metrics of lncRNAs and miRNAs, respectively.As be noticed, block ML is the transpose of LM.

K-Mer Segmentation.
Before numerical describing lncRNA and miRNA, the amino acid sentences have to be segmented for building corpuses.For the given sentences, Kmer method is employed to originally explore the semantic features [26,27].Specifcally, the sentences are divided into subsequences by sliding windows, and the length of this window is K.For instance, the sentence which contains N amino acids, it will generate N − K + 1 subsentences.In this section, the parameter K is set as 4. Te amounts of the possible lncRNA and miRNA subsentences are all 4 K , for lncRNAs containing the Alanine (A), Cysteine (C), Glycine (G), and Treonine (T) amino acids; miRNAs contain Alanine (A), Cysteine (C), Glycine (G), and Selenocysteine (U) amino acids [28].An example of NONHSAT129051.2 converting into subsequences is shown in Figure 2.

Distribution Representation of lncRNA and miRNA
Sequences.Te amino acid subsequences of lncRNAs and miRNAs were utilized to construct the RNA2Vec models including Lnc2Vec and Mi2Vec models for word embedding.Tese models diferentially characterize lncRNA and miRNA based on biological evidence.Ten, the sequences were transformed into digital vectors as attribute characteristics.Considering the size of the benchmark data set, the word2vec model with skip-gram is applied to realize the word representation [29,30].In general, this model constructs a projection neural network to learn the distribution of words by sliding windows.Te framework of the skipgram is displayed in Figure 3.
With regard to the sentence (w 1 , w 2 , • • • , w N−K+1 ) where w represents the amino acid, the objective function of the model is defned as follows: where c is the maximum distance between the words and the central word in the sliding window.Te conditional probability log P(w n+m | w n ) is calculated by the following equation ( 31): where v w and v w ′ represent the original and output formats of the word w, respectively.W stand for the established lexicon width.Similar to the nondigital text data processing, this section inputs the sequences and corpuses into the network for obtaining the numerical vectors as semantic features of genes.After optimizing, the parameters size, window, iter, and batch_words are set as 500, 5, 10, and 10, respectively.Te size determines the length of the output vector; the window gives the maximum distance between a central word with contextual words; iter is the number of model iterations; batch_words represents the account of words transferred to operations.Te other parameters are set as default values.Figure 4 gives an example of the semantic embedding process.[32].It is widely utilized to characterize the associations between the central node and neighbor nodes.Tis model linearly extends the number of edges and learns to encode the hidden layer representation of local graph structure and node features [33].Terefore, GCN is employed to calculate the behavior features.Te structure of GCN is shown in Figure 5.

Graph Convolution
In this research, the associations between lncRNAs and miRNAs are embedded in the graph G � (V, E), V and E denote the nodes and edges, respectively.Meanwhile, the attribute features and adjacency matrix are fed into GCN [34,35].Hence, the GCN can also integrate the attribute features with behavior features.Specifcally, the Laplace regularization term is applied to defne the convolution International Journal of Intelligent Systems kernel for diferentially describing nodes.Te Laplace regularized formula is expressed as follows: where I is the adaptive identity matrix, D is the degree matrix, A is the adjacency matrix.Te application of the Laplace regularization term will make the convolution process smoother.Te graph matrix decomposition of L and the convolution process are defned as follows: where x denotes the input features of nodes.It aims to transfer nodes into Fourier space.Subsequently, the Chebyshev matrix is employed to approximate the convolution kernel for reducing the computational complexity, the new convolution kernel is as follows: where λ max is the maximum eigenvalue of L. In this paper, we only consider the frst-order neighbors of the central node, the parameters K and λ max are set as 1 and 2. Te Chebyshev recursive formula is defned as follows: Te equation of graph convolution is updated to the followinf equation: For preventing gradient explosion, the equation is further standardized as follows: Here, X is an N × C matrix containing C-dimensional semantic features of N nodes, and Θ is a trainable matrix set.According to the convolution kernel, the forward propagation formula of a two-layer model is as follows: where W (0) , W (1) represent the weight matrix of hidden layers 1 and 2. Finally, matrix F is utilized as the output feature integrating attribute and behavior characteristics.
2.6.Rotation Forest.Rodriguez et al. [36] evolved integrated forest into rotation forest (RoF) which promotes the difference by adding a rotation module.It provides a feasible pipe for dealing with the benchmark dataset.Terefore, this ensemble model is established to promote the feature difference and detect the interactions between lncRNAs and miRNAs [37].Firstly, the whole subjects have to be stochastically disjointed into L independent subsets.Subsequently, Principal Component Analysis (PCA) algorithm attends to transfer subsets for integrating RF.Finally, these converted subsets are fed into the base classifers for scoring subtrees.Te matrix Q of s × S represents the train set with S characteristics of s samples.Te corresponding labels R � (r 1 , r 2 , • • • , r n ) T are also sent into the model to supervise the training process [38].Tis model has K base classifers H i .
Te sequential training parts are as follows: (i) After optimizing the parameters, the dataset Z is divided into L disjoint subsets containing m � S/L features.(ii) Regard Z i,j as jth subset of Z, and Q i,j as the feature set of Z i,j .Ten, bootstrap sampling on 75% of Q i,j to generate training set Q i,j ′ .(iii) Apply PCA in Q i,j ′ to obtain principal component coefcients a (1)  i,j , a (2)  i,j , • • • a (m j ) i,j .(vi) Te rotation matrix P i made up by coefcients is as follows: In the classifcation, the possibility that the sample x belongs to category r i is d i,j (xP a i ) calculated by base classifer H i .Furthermore, count the confdence degree that x is assigned to each class, the method is as follows: Te fnal category of sample x will be given according to the degree.

6
International Journal of Intelligent Systems

Results and Discussion
where true positive (TP) represents the sum of associated lncRNA-miRNA pairs which are detected as positive samples; true negative (TN) records the number of nonassociated pairs which are classifed as negative samples; false positive (FP) denotes the aggregate of associated lncRNA-miRNA pairs which are inferred as negative samples; false negative (FN) is the count of nonassociated pairs which are predicted as positive samples.In addition, Receiver Operating Characteristic (ROC) and Precision-Recall (PR) curves are depicted to visualize the experimental results [39,40].Te area under the curves (AUC) and area under the PR (AUPR) values are also attached to ROC and PR curves for justifying our model and indicating the sample balance.

Parameter Discussion.
In the experiment, the parameters K and L which represent the numbers of feature subsets and decision trees need to be optimized in the RoF classifer.
To obtain the optimal results, the grid-search algorithm is applied to picture the accuracy surface of prediction results with diferent parameters.Tere are 2400 experiments with various groups of parameters K and L were conducted.With the increment of the L-value from 0 to 60, the surface indicates that the accuracy is continuously improved.Meanwhile, the accuracy is increased and then dramatically declined with the value rise of parameter K from 0 to 40.For high efciency, the parameters K and L are set as 26 and 35.
Figure 6 shows the prediction accuracy surface with diferent K-values and L-values.

Five-fold CV Results on Benchmark Data Sets.
To fairly validate the prediction feasibility of SEBGLMA and prevent overftting and underftting.We applied the 5-fold crossvalidation (CV) method on the benchmark dataset with the same optimal parameters.Specifcally, the standard dataset was segmented into fve independent parts of the same size.Te disjointed subdatasets take turns as a test set, while the other four sets are treated as train sets.Te statistics of 5-fold CV results on the benchmark dataset based on SEBGLMA are listed in Table 2. Te average accuracy, precision, sensitivity, specifcity, Matthews correlation coefcient, and F1-Score are 87.09%,87.66%, 87.03%, 87.84%, 74.18, and 86.99%.Te corresponding standard deviations are 0.59%, 1.02%, 0.87%, 0.65%, 0.12%, and 0.68%.Figures 7 and 8 display the performance of our model by ROC and PR curves, the mean AUC value of 0.9301 and AUPR value of 0.9323 are attached to them.

Comparison with Other Classifers.
At present, there are many supervised learning classifers are established for identifying the association between lncRNA and miRNA based on the 5-fold cross-validation method.For further evaluating the performance of rotation forest (RoF) in SEBGLMA.We replaced the RoF with a support vector machine (SVM), deep learning with dual-net neural architecture (DLDP) [41], light gradient boosting machine (LGBM), and random forest (RF).In comparison, the parameters of RoF are set as the values which are optimized above.Te inner product kernel is applied to map features into a high dimension in the SVM classifer.Furthermore, the classifcation process is simplifed by the small sample learning mechanism.After optimization, the parameters c and g are set as 0.7 and 38, respectively with radial basis function (RBF) based on the LIBSVM tool.Te DLDP classifer is a dual-net natural network combing Feature Importance Ranking (FIR), and Multiple-Layer Perceptron (MLP).We set parameter max batches � 3000 and the feature related parameters into 300, the rest parameters are set as default values.

Conclusion
In general, this paper provides a novel mechanism integrating K-mer segmentation, word2vec, Gaussian interaction profle, graph convolution network, and rotation forest to infer the LMAs.Specifcally, the K-mer segmentation and word2vec are utilized to extract the attribute features of RNAs.Ten, the adjacency matrix is completed by combining the GIP self-similarity and the known relationships between lncRNAs and miRNAs.Sequentially, the attribute features and the completed adjacency matrix are sent into GCN for embedding behavior characteristics.Finally, the fusion features are fed into RoF for identifying LMAs.Te mean accuracy, precision, sensitivity, specifcity, Matthews correlation coefcient, and F1-Score of the proposed model were 87.09%, 87.66%, 87.03%, 87.84%, 74.18%, and 86.99%, respectively.For ensuring the advancement of our model, we also systematically conducted comparisons.First of all, the classifer was altered by SVM, DLDP, LGBM, and RF to validate the performance of RoF.Secondly, the ablation experiments are carried on to prove the optimization efciency of each module.Finally, many state-of-art methods are employed to evaluate the prediction performance.Te comparisons indicate that our model can be a robust and efcient tool to screen reliable candidates for clinical trials.

Limitation and Feature Work
Besides improving the accurate prediction ability of the model, the limitations of the proposed model are also noticed.Te limitations focusing on two aspects will be illustrated in this section.On the one hand, the graph convolution network only considers the mutual relationships between the target nodes and the directly connected primary nodes to obtain the local behavior feature information.It is hardly to achieve the global structure information of the target node.In future work, graph neural networks with diferent distances between the target and neighbor nodes will be constructed at the same time, and these networks will be cascaded to extract the global structure information of target nodes.On the other hand, the noise and feature loss in the preprocessed sequence features will reduce the robustness of the model.We will develop a sequence data denoising algorithm to improve the diference between samples.Generally, the subsequent research will emphasize excavating more robust characteristics with less noise and constructing reliable classifers.Te expansion of the high-throughput database will furnish a data foundation for building complementary identifcation tools.

Figure 1 :
Figure 1: Te fow chart of SEBGLMA.(a) Complete adjacency matrix by GIP kernel similarity.(b) Calculate semantic features by K-mer and word2vec.(c) Integrate semantic and behavior features by graph convolution network.(d) Feed the fusion features into the rotation forest for predicting LMAs.

Figure 6 :Figure 7 :Figure 8 :
Figure 6: Te accuracy surface of the optimization on K-value and L-value.

Table 1 :
Te statistics of the benchmark data set.

Table 2 :
5-fold cross-validation results on benchmark data set obtained by SEBGLMA.
Table 4gives the results of ablation experiments.Firstly, we employed RNA2Vec models to extract the attribute features of samples and input them into the classifer as basic experiments.Ten, GIP is utilized to extract the self-similarity between the same kind of samples, and it is simply spliced with attribute features for predicting LMAs.Compared with single attribute features, the model which integrates RNA2Vec and GIP diferentiates samples by splicing the semantic features and self-similarity features.Te evaluation indicators of the model increased by 13.77%, 14.52%, 14.39%, 13.46%, 11.48%, and 15.39%, respectively.After that, the adjacency matrix generated by attribute features and known lncRNA-miRNA associations was fed into the GCN to generate fusing features.Te criteria of the model which combines RNA2Vec and GCN were improved by 17.17%, 17.24%, 17.72%, 16.62%, 34.35%, and 17.48%.Finally, after integrating RNA2Vec, GIP, and GCN, SEBGLMA makes full use of the relationship between nodes to embedding attribute features into behavior features.
Finally, the unrelated positive samples and the test data set are subtracted to obtain the fnal negative sample data set.Te same number of negative samples are obtained from the remaining 102791 (419 × 263 − 7143 − 263) pairs without validations.After sorting by prediction scores, the top 30 predicted interactions are listed in Table7.Within the results, there are 27 of the top 30

Table 4 :
5-fold cross-validation results on benchmark data set obtained by ablation experiments.

Table 5 :
Comparison between our model with state-of-art methods in terms of benchmark data sets.Te bold value represents the highest AUC value in the comparison.