Similarity Network Fusion Based on Random Walk and Relative Entropy for Cancer Subtype Prediction of Multigenomic Data

It is a crucial task to design an integrated method to discover cancer subtypes and understand the heterogeneity of cancer based on multiple genomic data. In recent years, some clustering algorithms have been proposed and applied to cancer subtype prediction. Among them, similarity network fusion (SNF) can integrate multiple types of genomic data to identify cancer subtypes, which improves the understanding of tumorigenesis. SNF uses a dense similarity matrix to obtain the global information of the data, and the interconnection of samples between different categories will cause noise interference. +erefore, how to construct a more robust dense similarity matrix is an important research content to improve the performance of cancer subtype identification. In this paper, we proposed similarity network fusion based on random walk and relative entropy (RSNF) for cancer subtype prediction. Firstly, the random walk algorithm was used to capture the complex relationship between samples in each genomic data. And the transition probability distribution of samples in the network was obtained. If two samples belong to the same class, the transition probability between the two samples is great. On the contrary, if the two samples do not belong to the same class, the transition probability between the two samples is small. In this way, the degree of correlation between samples can be well obtained, thereby reducing the noise interference caused by the interconnection of samples between different categories. Secondly, relative entropy was used to calculate the difference in the transition probability distribution between samples to construct a better dense similarity matrix which contains structural similarity information between samples. +irdly, we iteratively fused the obtained dense similarity matrix with the KNN similarity matrix to construct the fused similarity matrix of all genomic data. Finally, by using spectral clustering, the fused similarity matrix was grouped into multiple clusters, which indicates the cancer subtypes. Experiments on seven cancer omics datasets show that the RSNF algorithm performs well in identifying cancer subtypes.


Introduction
With the rapid development of high-throughput technology, a large amount of genomic data has been generated, including gene expression data, DNA methylation data, and DNA copy number variation data. In particular, e Cancer Genome Atlas (TCGA) [1] database researches different genome, transcriptome, and epigenome information of more than 1,100 patients from more than 34 cancer types. ese data have brought unprecedented opportunities to cancer research, such as driven gene selection [2] and cancer subtype prediction, so that cancer can be controlled more thoroughly and comprehensively.
Various types of genomic data are closely related to the occurrence and development of cancer. In general, cell growth and differentiation are regulated by the gene expression level, and the changes in the gene expression level will lead to transformation from normal cells to cancer cells [3]. DNA single-nucleotide polymorphisms and copy number variations in the genome affect gene instability and cancer gene activation through gene amplification or cancer suppression [4]. DNA methylation in epigenetic variation is also common in cancer genomes. Genome-wide hypomethylation can lead to genome instability. e hypomethylation of CpG islands is also related to the inactivation of cancer suppressor genes [5]. At present, many studies have attempted to use these genomic data to predict cancer subtypes. However, the cancer genome is regulated by a variety of molecular mechanisms, the complexity and independence of which make it difficult to discover the relationship between the cancer genome and the cancer phenotype. erefore, integrating different genomic data to capture the complexity of phenotypes and the heterogeneity of biological processes [6,7] is the current trend in predicting cancer subtypes.
In the past few decades, many genomic data integration algorithms have been extensively developed. For example, Shen et al. [8] proposed a joint latent variable model named as iCluster, which combines the correlation between different types of genomic data and the variance-covariance structure within the data type to mine potential cancer subtypes. Akavia et al. [9] proposed an algorithm based on the Bayesian network to integrate the matching chromosome copy number and gene expression data of tumor samples to identify driving mutations and their influence processes. Liang et al. [10] proposed a multimodal deep belief network algorithm, which encodes the relationship between features of each genomic data as a multilayer network of hidden variables and then fuses common features to cluster cancer into different subtypes. Speicher and Pfeifer [11] added regularization constraints in the optimization process of multikernel learning to avoid overfitting and used one kernel for each genome data type to solve the problem of kernel function and parameter selection. Wang et al. [12] proposed a multiplexed network, which integrates heterogeneous genomic data by using the links between each node in a network slice and its corresponding nodes in each other network slice. Van et al. [13] used sequencing matrix decomposition to represent genomic data and identify cancer subtypes based on mutations and gene expression characteristics. Zhang and Ma [14] proposed a regularized multiview subspace clustering method to integrate gene expression data with the protein interaction network of dynamic modules. Network-based stratification (NBS) [15,16] method combines genome-scale somatic mutation profiles with a gene interaction network to produce a robust subdivision of patients into subtypes. And the gene interaction network is constructed by protein-protein interactions (PPI). Simultaneous rank matrix factorization (SRF) [13] method approaches the subtyping problem by decomposing patient-mutation and patient-expression data into ranked factors.
Among these integrated algorithms, Wang et al. [6] proposed a very effective cancer subtype identification algorithm-similarity network fusion (SNF). SNF consists of three stages: network construction, network fusion, and clustering. In the network construction stage, the Euclidean distance of each omics data is used to construct a patient similarity network. In the network fusion stage, the information dissemination theory is used to perform nonlinear iterative fusion of the constructed network. Finally, the spectral clustering algorithm is used for clustering. SNF integrates mRNA expression data, DNA methylation data, and miRNA expression data and establishes a cancer subtype prediction model on five cancer datasets.
At present, many studies have improved and expanded SNF. Xu et al. [17] proposed a weighted similarity network fusion algorithm, which uses a complex miRNA-TF-mRNA regulatory network to identify cancer subtypes. In order to solve the problem that SNF is only applicable to data types containing continuous values, Yang et al. [18] used the random walk method to smooth the discrete somatic mutation data and incorporated the smoothed data into the SNF algorithm so that SNF can fuse discrete data. Yang et al. [19] proposed a deep subspace fusion clustering algorithm, which used the methods of self-encoding and data self-expression to guide the deep subspace model, which can effectively express the discriminant similarity between samples, thereby realizing the difference transfer between clustering clusters and the enhancement of compactness within clustering clusters. In view of the superior performance of SNF, it has become one of the most popular algorithms for cancer subtype identification. erefore, this paper improves SNF from the perspective of similarity matrix construction, aiming to further improve the recognition effect of SNF on cancer subtypes.
After SNF completes network fusion, it needs to be clustered through spectral clustering [20]. e essence of spectral clustering is to map the Laplacian matrix so that the samples in the original space that are not easy to handle can be easily processed in the mapped space.
e Laplacian matrix is calculated by the similarity matrix, so the construction of the similarity matrix is the key to SNF. SNF constructs two similarity matrices for each genomic data, dense similarity matrix and sparse similarity matrix, which are used to capture global and local information of genomic data, respectively. In SNF, K-nearest neighbor (KNN) algorithm is used to construct the sparse similarity matrix. KNN algorithm is the most commonly used and effective sparse similarity matrix construction method. All samples in the dense similarity matrix have connecting edges. In spectral clustering, the interconnection of samples of different categories in the dense similarity matrix will cause noise interference and affect the segmentation effect of spectral clustering. erefore, how to optimize the dense similarity matrix has become a major problem faced by SNF.
In this paper, we proposed similarity network fusion based on random walk [17] and relative entropy (R 2 SNF) for cancer subtype prediction. Random walk and relative entropy are used to measure the similarity between samples to construct a more robust dense similarity matrix on each genomic data. e similarity matrix construction method based on random walk measures the transition probability of a sample walking along a randomly selected adjacent edge to reach other samples, thereby forming a transition probability distribution of this sample. In order to better measure the similarity between samples, the relative entropy is used to calculate the difference of the transition probability distribution of them, and the similarity between them is obtained: the greater the difference between two probability distributions is, the less similar the corresponding samples are, and vice versa. e dense similarity matrix construction method is to establish a random walk point on the basis of the conventional dense similarity matrix. It uses the difference in the transition probability distribution between samples to measure the similarity of two samples so that similar samples have a larger similarity value, and samples that are not of the same class have a smaller similarity value.
us, a more robust dense similarity matrix is obtained. In our R 2 SNF, we use the dense similarity matrix obtained above and the sparse similarity matrix obtained by the KNN algorithm to perform similarity network fusion for different genomic data. Finally, we use spectral clustering to cluster the fusion similarity matrix. Experimental results on multiple genomic data show that R 2 SNF can identify biologically significant cancer subtypes.

Methods
In this section, we will introduce our algorithm similarity network fusion based on random walk and relative entropy (R 2 SNF) in detail. Firstly, the probability distribution of random walk from one sample in each genomic data to other samples in the network is calculated. Secondly, relative entropy is used to calculate the difference of the probability distribution of the two samples, and the robust dense similarity matrix is constructed.
irdly, similarity network fusion between the constructed robust dense similarity matrix and the KNN similarity matrix is performed to obtain the fused similarity matrix. Finally, spectral clustering is used for clustering the fused similarity matrix. [21] is a random process model that can simulate the interaction between samples in the network. e random walk on the graph can be regarded as a Markov chain of randomly selected nodes. After years of development, a variety of random walk algorithms have been produced. Here, we use the random walk with restart (RWR) algorithm proposed by Tong et al. [22]. Given a set of cancer genomic data

Construction of the Random Walk Model. Random walk
where V represents the number of genomic data, X v is the vth genomic data in X, m v represents that the vth genomic data have m features, and n represents the number of samples. For each genomic data X v , starting from the ith sample x v i , each step of the RWR faces two choices: choose the adjacent sample with the probability of α or return to the starting sample with the probability of 1 − α; then, the sample x v i will transfer to any sample and reach a stable state at the time t + 1. According to the Markov decision process, the current state of the system is only related to the state at the previous moment. erefore, the stable state vector r v t+1 (x v i ) at the time t + 1 can be defined as where r v t (x v i ) represents the state vector at the time t, r v 0 (x v i ) is the initialization vector with the ith element being 1 and the remaining elements being 0, and A v ∈ R n×n represents the transition probability matrix of each genomic data.
Under normal circumstances, the probability transition matrix of the random walk on the graph can be represented by the adjacency matrix after data normalization. We adopt the following ideas to construct the transition probability matrix A v .
Firstly, we construct the similarity matrix W v ∈ R n×n for each genomic data by where W v (i, j) represents the similarity between sample x v i and sample x v j , μ is an empirical hyperparameter, and where mean(ρ(x v i , N v i )) denotes the average of the distances between the sample x v i and its neighbors. In the process of random walk, A v is a probability transition matrix, which needs to meet the condition where D v is the degree matrix, and its diagonal elements

Construction of the Similarity Matrix Based on Relative
Entropy. After calculating the stable state transition probability distribution r v from the RWR in Section 2.1, the similarity S v (x v i , x v j ) between the sample x v i and sample x v j is usually defined as [23] S where r v is the probability of starting from x v i and arriving at x v j via random walk. However, this method only considers the probability value of the random walk between the two samples and ignores the structural similarity between them.
In order to better measure the similarity between samples, the difference in the transition probability distribution of two nodes is used to define the structural similarity. We use the relative entropy to construct the dense similarity matrix [24]. Relative entropy, also known as Kullback-Leibler (KL) divergence [25], is a method to describe the difference between two probability distributions. Here, relative entropy is used to calculate the difference of the transfer probability distribution of different samples.
For sample x v i , the transition probability distribution r v (x v i ) of reaching any other sample to reach a stable state after random walk can be written as where n is the number of samples and can be defined as For the transition probability distribution r v (x v i ) and r v (x v j ) of any two samples x v i and x v j , respectively, the relative entropy can be defined as Relative entropy is an asymmetric measure; that is, . erefore, the probability distribution difference matrix is defined as C v ; then, the difference between any two probability distributions is C v (i, j): Finally, C v is transformed into a similarity matrix S v , where the elements are defined as S v (i, j): where C v max is the maximum in C v . From equation (8), we can get the following: when the transition probability distribution between samples x v i and x v j differs greatly, that is, the value of C v (i, j) is very large, a smaller value of S v (i, j) is assigned. On the contrary, when the difference of the transition probability distribution between samples x v i and x v j is small, that is, the value of C v (i, j) is small, a great value of S v (i, j) is assigned. us, the construction of the similarity matrix based on relative entropy is realized.

Similarity Network Fusion Based on Random Walk and Relative Entropy (R 2 SNF).
rough the above two steps, the similarity matrix S v is obtained. In the similarity network fusion stage, we use S v as a dense similarity matrix to obtain the global structure between samples and use the KNN similarity matrix to capture the local structure.
For any samples x v i , KNN defines the similarity matrix K v between x v i and its k most similar samples. e element where N v i is the neighbors of x v i . Assume that there is a total of V genomic data to be integrated. In the same way as SNF, we performed nonlinear iterative fusion for dense similarity matrix S v and sparse similarity matrix K v of each dataset. e fusion process can be described as According to equation (12), we can obtain the similarity matrix S v of the cross-diffusion of the vth genomic data with other data. en, the final fused similarity matrix S can be obtained by averaging all S v :

Spectral Clustering on the Fused Similarity Matrix.
Suppose we want to identify c cancer subtypes from multiple genomic data, so we need to use spectral clustering to cluster cancer samples into c clusters. For the ith sample, we defined a cluster indicator vector y i ∈ 0, 1 { }. When the ith sample belongs to the jth cluster, y i (j) � 1; otherwise, y i (j) � 0. e cluster indicator matrix can be written as Y � (y T 1 , y T 2 , . . . , y T n ). With the fused similarity matrix S, spectral clustering can be performed by solving the following optimization problem: where U � Y(Y T Y) −1/2 , U ∈ R n×c , is the scaled partition matrix. According to the fused similarity matrix S, L as the normalized Laplacian matrix can be defined as L � I − D −1/2 SD −1/2 , where D is the degree matrix, which satisfies D � diag(d 1 , d 2 , . . . , d n ), d i � n j�1 S(i, j). In this way, we can capture the global structure of the fused similarity matrix through spectral clustering.

Datasets and Survival Analysis.
In this paper, we tested the proposed algorithm on three types of genomic data, that is, mRNA expression data, miRNA expression data, and DNA methylation data. e cancer types we tested include glioblastoma multiforme (GBM), breast invasive carcinoma (BIC), kidney renal clear cell carcinoma (KRCCC), lung squamous cell carcinoma (LSCC), and colon adenocarcinoma (COAD). e above data can be downloaded from the TCGA website [5]. In addition, we also conducted experiments on the BREAST cancer and LUNG cancer datasets in [26]. e detailed information of the cancer multigenomic datasets is shown in Table 1.
is paper conducts survival analysis based on the cancer subtypes obtained by clustering to verify the survival differences among samples of different cancer subtypes found by the proposed algorithm. In statistics, hypothesis testing is usually used to quantify whether there are differences between different survival curves. Here, the Cox log-rank test [27] is used to calculate the p value. Cox log-rank test is a nonparametric hypothesis test, which is often used to assess the importance of differences in survival between subtypes. e p value indicates that the observed difference in survival is the likelihood of an incident occurring by chance. erefore, the smaller the p value is, the better the experimental effect is. In addition, the Kaplan-Meier estimation method [28] is usually used to estimate the survival function and further obtain the Kaplan-Meier survival curve. e xaxis of the survival curve is the time from the beginning of observation to the last observation time point. e y-axis is the survival rate of the survival sample. e curve represents the development of the event.

Experimental Results.
We compared the proposed algorithm R 2 SNF with several cancer subtype prediction methods, e.g., SNF [6], LRAcluster [29], iClusterPlus [30], pattern fusion analysis (PFA) [31], affinity network fusion (ANF) [32], and multiview clustering based on Stiefel manifold (MCSM) [33], to verify its effectiveness. In order to verify whether the relative entropy in the R 2 SNF algorithm can improve the prediction results of cancer subtypes, we remove the relative entropy from R 2 SNF and use equation (5) to construct the similarity matrix. We name the above algorithm as similarity network fusion based on random walk (RSNF). A brief introduction to these methods is as follows: (i) SNF first uses the exponential similarity kernel method to define the similarity between the sample points of each genomic data. It uses the KNN method to define a dense similarity matrix and a sparse similarity matrix. en, the information transfer model is proposed to fuse the above two similarity matrices, and the fused similarity matrix can be obtained by updating iteratively. Finally, spectral clustering is used to cluster the fused similarity matrix. (ii) LRAcluster is a dimensional reduction and clustering method for multigenomic data based on lowrank approximation. It can deal with a variety of distributed data classes and guarantee the orthogonality of the low-dimensional space. It is suitable for clustering analysis of large-scale multigenomic data and has been widely concerned and applied. (iii) iClusterPlus considers that different variable types should follow different linear probability relationships. en, it builds a joint sparse model to complete the task of sample clustering and feature selection. (iv) PFA uses the local information extraction method to project each genomic data in a low-dimensional space and builds a dynamic collimation method based on the idea of manifold learning. en, it integrates the low-dimensional spatial information into a feature space containing information from different genomic data. Finally, the K-means method is used to cluster the samples.
(v) ANF first constructs a patient affinity network from each omics data and then fuses all individual networks to obtain a more robust one. In order to make the patient affinity network robust to noise, ANF mainly employs two nonlinear k-nearestneighbor-(kNN-) based transformations: kNN Gaussian kernel and kNN graph. (vi) MCSF establishes a binary optimization model for the simultaneous clustering problem. en, the optimization problem is solved by the linear search algorithm based on the Stiefel manifold. Finally, it integrated the clustering results obtained from multiomics data by using the k-nearest neighbor method. (vii) RSNF obtains the probability of each sample starting from one sample and arriving at another via random walk, calculates the similarity matrix according to the random walk probability between the two samples, and finally performs similarity network fusion according to SNF.
Since R 2 SNF is an improved version of SNF, in order to make a more intuitive comparison and analysis, we used the number of clusters suggested in SNF, that is, GBM is clustered into 3 categories, BIC is clustered into 5 categories, KRCCC is clustered into 3 categories, LSCC is clustered into 4 categories, and COAD is clustered into 3 categories. For the BREAST and LUNG datasets, we also used the cancer subtype determination method in SNF to determine the number of their cancer subtypes as 3 and 2, respectively. e specific experimental results of R 2 SNF and other methods on the seven cancer multigenomic datasets are shown in Table 2. Compared with RSNF, R 2 SNF had better results on the other six datasets except for KRCCC data. is shows that using relative entropy to calculate the probability distribution difference between samples is beneficial to the construction of the similarity matrix. Compared with SNF, R 2 SNF has smaller p values on all datasets except for COAD. e results of RSNF on GBM, BIC, KRCCC, and LSCC are better than SNF, especially on KRCCC and LSCC data, but slightly worse than SNF on other data, which indicates that only using the probability obtained by random walk between samples to construct the similarity matrix also has a certain effect on cancer subtypes. Compared with other algorithms, R 2 SNF has the best results on the whole. Only on BIC data, MCSM algorithm is better than R 2 SNF. Figure 1 shows the Kaplan-Meier survival curve of cancer subtypes identified by R 2 SNF on seven cancer  Figure 1(b), R 2 SNF is not very effective when divided into 5 subtypes, but it can clearly divide it into 3 subtypes. Moreover, the p value of SNF on the BIC data is lower than the p value of SNF. erefore, we recommend that BIC should be divided into 3 subtypes.
e number of clusters given in the BREAST dataset in [26] is 3, which can be found in Figure 1(f ). is further verifies our conclusion.

Analysis on the GBM Dataset.
Glioblastoma multiforme (GBM) is the most common and lethal malignant primary brain tumor in adults and is one of a group of tumors known as gliomas. Many studies have carried out research on GBM at the molecular level. And clinically, some studies have given definite cancer subtypes and corresponding treatment plans. For example, based on mRNA expression data, Verhaak et al. [34] divided GBM into four cancer subtypes: mesenchymal, classical, neural, and proneural. In [35], according to the difference of the CpG island methylator phenotype (CLMP), GBM was divided into two cancer subtypes: G-CLMP and non-G-CLMP.
On GBM data, we counted the distribution of clustering results obtained by R 2 SNF on the cancer subtypes determined in the above two studies and summarized the results in Table 3. Table 3 shows that the patients in subtype 1 are more than in subtype 3. Most patients in subtype 1 are grouped into non-G-CLMP (accounted for 99.3%); also, they are distributed on four subtypes in [34]. Subtypes 2 and 1 have similar distributions. It is worth noting that most of the 19 patients with subtype 3 are of the G-CLMP subtype (accounted for 73.7%), and all of them are of the proneural subtype.
To further analyze the obtained cancer subtypes by R 2 SNF, the clinical data for all patients of GBM were downloaded from the cBio Cancer Genomics Portal database. We drew a boxplot of the age distribution of patients in the three cancer subtypes (Figure 2). Figure 2 proves that the cancer subtypes identified by R 2 SNF have a clear age distribution difference. Combining Figures 1 and 2, we can find that the age of patients in subtype 3 with the best survival advantage in Figure 1 is also lower than that of patients in subtypes 1 and 2.
Furthermore, we drew Kaplan-Meier survival curves of GBM patients' response to the drug temozolomide (TMZ) in Figure 3. e patients within the three cancer subtypes were divided into two parts: patients treated with drug TMZ and those not treated with drug TMZ. TMZ is a drug that is commonly used to treat GBM, but only responds well to a subset of patients. e p values of survival analysis in the Cox log-rank model of the three cancer subtypes are 5.42 × 10 −6 , 3.78 × 10 −4 , and 0.36, respectively, which indicate that TMZ has no effect on the patients in cancer subtype 3.
In summary, subtype 3 of GBM identified by R 2 SNF has the following characteristics. First, most of the patients with subtype 3 are of the G-CLMP subtype, and all of them are of the proneural subtype. Second, the age of patients in subtype 3 with the best survival advantage is also lower than that of patients in subtypes 1 and 2. ird, TMZ has no effect on the patients in cancer subtype 3. erefore, we believe that subtype 3 identified by R 2 SNF is a biologically significant cancer subtype. In addition, it can be inferred that we get a potential cancer subtype, which contains patients belonging to both G-CLAMP and Proneural. is verified the study reported by Brennan et al. that the proneural subtype granted by the G-CIMP phenotype has unique properties [36].

Analysis on the BREAST Dataset.
Breast cancer refers to a malignant tumor in which cancer cells have penetrated the basement membrane of breast ducts or lobular alveoli and invaded the interstitium. Many scholars have carried out a series of studies and analyses on the gene level and have given specific subtypes and treatment programs. Based on the microarray predictive analysis model, Parker et al. proposed a 50-gene classifier (known as PAM50) to classify BIC into five subtypes: basal-like, luminal A, luminal B, HER2-enriched, and normal-like [37]. On BREAST data, we counted the distribution of clustering results obtained by R 2 SNF on the cancer subtypes basal-like, luminal A, luminal B, and HER2-enriched in Table 4. It can be seen from Table 4 that subtype 1 is mainly distributed in luminal A and luminal B (accounted for 80.6%), subtype 2 is mainly distributed in basal-like (accounted for 74.6%), and subtype 3 is mainly distributed in luminal A and luminal B (accounted for  Scientific Programming 70.8%). In addition, we can also find that HER2-enriched is mainly distributed in subtypes 1 and 2 (accounted for 89.1%), and normal-like is mainly distributed in subtype 1 (accounted for 78.3%). We also chose two clinical labels for which we tested enrichment: Pathologic M and Pathologic N. Pathologic M and Pathologic N are regional lymph nodes' distant metastasis stage (M) and clinical stage (N) of breast cancer, respectively. Pathologic M includes three stages: M0, M1, and MX. Pathologic N roughly includes five stages: N0, N1, N2, N3, and NX. Generally, the numbers or letters after N and M provide more details about these factors, and the higher the number, the more severe the cancer.
We used the chi-square test to verify whether there was a significant difference in our analysis among these clinical labels. e p values on Pathologic M and Pathologic N are 6 × 10 −3 and 9 × 10 −3 , respectively. e detailed distributions of subtypes obtained by R 2 SNF on Pathologic M and Pathologic N are shown in Tables 5 and 6, respectively. In Table 5, subtype 1, subtype 2, and subtype 3 have the similar distribution: mainly distributed in M0. We calculated the proportion of samples belonging to the M0 stage in the three subtypes as 74.9%. In Table 6, subtype 1, subtype 2, and subtype 3 have the similar distribution: mainly distributed in N0 and N1. e proportion of samples belonging to the N0 stage and N1 stage in the three subtypes is 46.3% and 33.8%, respectively.
From the above analysis, we can draw the following conclusion. First, subtypes 1 and 3 are mainly distributed in luminal A and luminal B, which are the breast cancer subtypes with the best prognosis. Second, subtype 2 is mainly distributed in basal-like, in which clinical prognosis     Figure 3: e Kaplan-Meier survival curves of the identified cancer subtypes by R 2 SNF: (a) subtype 1, (b) subtype 2, and (c) subtype 3 of TMZ response. "Untreated" represents the patients who did not receive TMZ treatment, and "Treated" represents the patients who received TMZ treatment.   is poor. ird, the patients in BREAST data are mainly in the early stages of breast cancer and have high survival rate. All these conclusions can also be verified in Figure 1(f ).

Conclusions
How to construct a robust dense similarity matrix is a key issue in SNF. In this paper, we analyzed the problems existing in the construction of the dense similarity matrix in SNF and proposed the similarity network fusion based on random walk and relative entropy (R 2 SNF) method for cancer subtypes' prediction. We proposed to use the random walk with restart algorithm to characterize the complex relationship between genomic data samples and obtained the stable state transition probability distribution of each sample. We further used relative entropy to calculate the difference in the transition probability distribution between samples to construct a better dense similarity matrix which contains structural similarity information between samples. en, the constructed dense similarity matrix and the KNN similarity matrix were nonlinearly iteratively fused. Finally, spectral clustering was used to cluster the fused similarity matrix. On seven cancer genomic datasets (GBM, BIC, KRCCC, LSCC, COAD, BREAST, and LUNG) containing three data types (mRNA expression data, miRNA expression data, and DNA methylation data), R 2 SNF was compared with a variety of classical cancer subtype prediction algorithms. Experimental results show that R 2 SNF has better performance in identifying cancer subtypes than the comparison algorithms. And through the analysis of the results of GBM and BREAST experiments, it can be proved that R 2 SNF can discover cancer subtypes with biological significance. In addition to relative entropy, there are other methods to measure the difference between two probability distributions, such as Jensen-Shannon divergence, Wasserstein distance, and crossentropy. In future work, we will devote ourselves to finding a more suitable method to calculate the difference between probability distributions and then to obtain a similarity matrix that is conducive to cancer subtype prediction.

Data Availability
e data used to support the findings of this study are available from the first author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest in this work.