Joint Nonnegative Matrix Factorization Based on Sparse and Graph Laplacian Regularization for Clustering and Co-Differential Expression Genes Analysis

,


Introduction
With the development of state-of-the-art sequencing technology, a large quantity of effective experimental data has been collected. ese data may imply some unknown molecular mechanisms. Bioinformatics is faced with the task of analyzing massive omics data.
e Cancer Gene Atlas (TCGA, https://tcgadata.nci.nih.gov/tcga/) includes gene expression profile data (GE), DNA methylation data (DM), copy number variation data (CNV), protein expression data, and drug sensitivity data. ese data are from approximately 15,000 clinical samples of more than 30 kinds of cancers [1]. ese massive data enable researchers to study the mechanisms of cancer production, diagnosis, and treatment at different biological levels.
e joint analysis of multiomics data can make up for lost or unreliable information in single omics data. In recent years, scientists have performed considerable research on the cancer mechanisms based on the joint analysis of cancer multiomics data. For example, Christina et al. integrated the gene expression data and copy number variations of breast cancer, identified possible pathogenic genes, and discovered new subtypes of breast cancer [2]. Wang and Wang used similarity network fusion to jointly analyze mRNA, DM, and microRNA (miRNA) data and identify cancer subtypes further [3]. In the existing joint analysis methods, those based on matrix decomposition are remarkable. Liu et al. integrated mRNA, somatic cell mutation, DNA methylation, and copy number variation data.
ey established a block constraint-based RPCA model to identify differentially expressed genes (DEGs) [4]. Integration and analysis of these heterogeneous multiomics data provide an in-depth understanding of the pathogenesis of cancer and promote the development of precision medicine. Recently, unsupervised integrative methods based on matrix decomposition have attracted considerable attention among the existing methods for integrating and analyzing multiomics data. Zhang et al. constructed a joint matrix factorization framework (jNMF) to discover multidimensional modules of genomic data [5]. Yang and Michailidis introduced a new method named integrative NMF (iNMF) for heterogeneous multiomics data [6]. Strazar et al. incorporated orthogonality regularization into iNMF (iONMF) to integrate and analyze multiple data sources [7]. Joint nonnegative matrix decomposition metaanalysis (jNMFMA) [8], multiomics factor analysis (MOFA) [9], and Bayesian joint analysis [10] have been successfully applied to the integration and analysis of cancer omics data. To avoid the influence of redundant information, many sparse modeling methods have been proposed. Typical applications are as follows: e weighted sparse representation classifier (WSRC) model combined with global coding (GE) [11] was used to predict interactions between proteins based on protein sequence information.
e network regularization sparse logic regression model (NSLR) [12] was used to predict survival risk and discover biomarkers. Sparse coregularization matrix decomposition was used to find mutant driver genes and so on [13].
In recent years, graph/network-based analysis as a powerful data representation tool has been applied to the modeling and analysis of complex systems [14][15][16][17]. In general, entities can be regarded as nodes, and the interaction between entities can be regarded as edges in the graph. Graph-based approaches can explore the local subspace structure and obtain the low-dimensional representation of high-dimensional data. Zhang and Ma proposed a subspace clustering algorithm based on a graph to detect the common modules highly correlated with cancer by jointly analyzing the gene expression and protein interaction networks [18]. Mixed-norm Laplacian regularized low-rank representation (MLLRR) was used to cluster samples [19]. Cui proposed an improved graphbased method to predict drug-target interactions [20]. Liu et al. introduced the contributions of deep neural networks, deep graph embedding, and graph neural networks along with the opportunities and challenges they faced [21]. Wu et al. proposed a multigraph learning algorithm called gMGFL that search and choose a group of decision subgraphs as features to move bags and bag labels to the instance [22].
Recently, sparse regularization has played a very important role in data analysis. e L 0 -norm, L 1 -norm, L 2,1 -norm, etc. are all typical sparse regularization methods. Among these many sparse constraints, L 2,1 -norm regularization stands out in terms of computational time and performance. e L 2,1 -norm can obtain a sparse projection matrix in rows to learn discriminative features in the subspace. Zhang used the L 2,1 -norm constraint on the coefficients to ensure that they are sparse in rows [23]. e L 2,1 -norm was applied to the predictor to ensure that it is robust to noise and outliers [24].
Considering the role of graph regularizations and L 2,1norm constraints in matrix factorization, we propose joint nonnegative matrix factorization based on sparse and graph Laplacian regularization (SG-jNMF). SG-jNMF can make the best of the potential associations and complementary information among multiomics data. e main highlights of this approach are as follows.
(1) Graph regularization is incorporated into the joint nonnegative matrix factorization model, and undirected graphs are constructed for input data in this method. Local graph regularization can preserve the local geometrical structure of the data space. erefore, SG-jNMF can use the low-dimensional characteristics of the observed data to find intrinsic laws and improve the performance of the integrated analysis method.
(2) L 2,1 -norm regularization can deal with each row of the matrix as a whole and can enhance the sparsity among the rows. erefore, involving the L 2,1 -norm can remove redundant features and noise in the data and further explore the clear cluster structure. (3) Two forms of SG-jNMF are proposed. SG-jNMF1 projects multiomics data into a fusion feature space. e fusion matrix contains complementary and differential information provided by multiomics data, so that more accurate results can be obtained when identifying Co-DEGs. SG-jNMF2 projects multiomics data into a common sample space, which results in more accurate clustering results.
e rest of this paper is arranged as follows: In Section 2, we start with a brief review of jNMF. Next, we introduce the SG-jNMF method, optimization process, and computational complexity analysis. Section 3 gives out the experimental results of clustering and feature selection. Finally, we summarize the whole paper and give some suggestions for future work in Section 4.

Joint Nonnegative Matrix Factorization.
e jNMF method was first proposed by Zhang et al. [5]. It can project multiple input data matrices into a common subspace, to integrate the information of each input data for analysis. Each type of genomic data as original data can be denoted as X I ∈ R M×N (I � 1, 2, 3, . . .). W ∈ R M×K is the common basis matrix, and H I ∈ R K×N is the corresponding coefficient matrix. e objective function of jNMF can be written as Obviously, jNMF is the same as NMF when P � 1. erefore, jNMF is the generalization model of NMF for multiple input datasets. Similar to NMF, multiplicative update rules are used to minimize the objective function. W and H I are iteratively updated according to the following rules. (3) e jNMF method can be used to integrate and analyze multiomics data. It decomposes multiomics data matrices into multiple independent coefficient matrices and a common fusion matrix at the same time and projects high-dimensional omics data into low-dimensional spaces. erefore, the abundant differential and complementary information of cancer multiomics data can be efficiently used, and multiomics datasets are analyzed simultaneously to obtain hidden information with biological significance.

Joint Nonnegative Matrix Factorization Based on Sparse and Graph Laplacian Regularization.
Manifold learning has become a popular research topic in the domain of information science since it was first proposed in science in 2000 [25,26]. Assuming that the data are uniformly sampled in a high-dimensional space, manifold learning can find the lowdimensional structure in the high-dimensional space and obtain the corresponding embedding mapping. Manifold learning looks for the essence of things from observed phenomena and finds the internal laws of data. e manifold assumption states that data points that are geometrically adjacent usually have similar characteristics. erefore, an e graph regularization with G is as follows: where T r (·) is the trace of the matrix, L is the graph Laplacian matrix, and L � D − U. D is a diagonal matrix and D i,j � j U i,j . Intuitively, the smaller the R k value is, the closer the two data points are. By minimizing R k , we can obtain a sufficiently smooth mapping function on the data manifold.
To decrease the influence of noise and outliers on real data, sparse regularization is usually used to penalize the coefficient matrix. e L 0 -norm, L 1 -norm, and L 2,1 -norm are all typical sparse regularization methods. e solution of L 0 -norm is a NP-hard problem. L 1 -norm is widely used because it has better optimization solution characteristics than L 0 -norm. L 1 -norm will tend to produce a small number of features, while the other features are all 0. erefore, it can be used for feature selection. However, L 1 -norm regularization is usually time-consuming. L 2,1 -norm regularization on the coefficient matrix can generate a row sparse result, and the calculation of the L 2,1 -norm is simple and convenient [23]. In this article, the L 2,1 penalty is incorporated in SG-jNMF [27]. e L 2,1 -norm of a matrix Z is defined as 2.2.1. SG-jNMF1. ere are two forms of SG-jNMF methods in this article. As shown in Figure 1, the SG-jNMF1 method projects multiomics data into a common feature space. Graph regularization and a sparse penalty are applied to the fusion feature matrix. e feature matrix is constrained by graph regularization, and as much intrinsic geometric information of the original multiomics data are preserved as possible. e L 2,1 -norm is used to constrain the feature matrix to reduce the influence of outliers and noise, and the objective function of integrating nonnegative matrix decomposition is constructed. e optimization problem can be expressed as where L I1 is the Laplacian matrix. L I1 � D I1 − U I1 , where U I1 is a symmetric matrix, which is the weight matrix constructed in graph regularization. D I1 is a diagonal matrix, and its diagonal elements are equal to the sum of the corresponding row elements or the sum of the column elements of the matrix; i.e., D I1ii � n j�1 (U I1ij ). With randomly positive initializing matrices Wand H I , the following update rules are executed until the algorithm converges: where Q is a diagonal matrix, the diagonal element is and ε is an infinitesimal positive number.

SG-jNMF2.
As seen from Figure 1, the SG-jNMF2 method projects multiomics data into a common sample space. Constraints are enforced on the common sample matrix. is method can be used to cluster multiomics data. e model can be shown by the following expression: Similarly, the algorithm iterates until it converges according to the following rules: where L I2 is the Laplacian matrix. L I2 � D I2 − U I2 , where U I2 is a symmetric matrix, which is the weight matrix constructed in graph regularization. D I2 is a diagonal matrix, and its diagonal elements are equal to the sum of the corresponding row elements or the sum of the column elements of the matrix; i.e., D I2ii � n j�1 (U I2ij ). B is a diagonal matrix, and the diagonal element is B jj � 1/ ����������� � m i�1 (W ij ) + ε. Obviously, the objective functions of the two kinds of SG-jNMF method are both nonconvex. We can obtain the optimal solutions by minimizing the objective functions. e optimization process is shown as follows.

Optimization of SG-jNMF.
Since the optimization processes of the two forms of SG-jNMF method are very similar, we only provide that of the first method. We use the multivariable alternating update rules to solve the optimization problem. Specifically, the following update steps are repeated until the algorithm converges.

Optimization of W.
When H I is fixed, the optimization of W is performed by minimizing the following objective function: e corresponding Lagrangian function is as follows: where Φ � [ϕ il ] and Ψ � [ψ Ia ] are the Lagrangian multipliers of Wand H I , respectively. Next, we take the first partial derivative of this Lagrangian function with respect to W: According to the KKT conditions [28], the following updating rule can be obtained: e corresponding Lagrangian function is as follows: and H I runs to convergence according to the following formula:

Convergence and Running Time.
In this paper, we also demonstrate the convergence of the method through experiments. Taking the pancreatic adenocarcinoma (PAAD) dataset as an example, the convergence of the five methods is shown in Figure 2. e error function used in this article is defined as follows: Compared with the other four methods, SG-jNMF can converge to the smallest error value with the fastest speed.
Besides, we also tested the running time of the above methods on the PAAD dataset. e means of these five methods running 10 times on a PC are shown in Table 1. As seen in Table 1, iGMFNA has the shortest running time, followed by SG-jNMF. is is due to the introduction of sparse constraints in SG-jNMF. e running time of iNMF, iGMFNA, jNMF, and SG-jNMF methods is satisfactory.

Computational Complexity Analysis.
In this part, we discuss the extra computational complexity of SG-jNMF compared to jNMF. We use big O symbol to represent the computational complexity of the algorithm. On the basis of the updating rules (3) and (4), we can easily count the arithmetic operations of each iteration in jNMF. Obviously, the cost for each iteration in jNMF is O(MNk). It should be noted that U I is a sparse matrix for SG-jNMF. In addition to the multiplicative updates, constructing a K-nearest neighbor graph requires O(N 2 M) operations [28]. Assume that the update stops after t iterations, and the overall cost for jNMF is O(tMNk). e overall cost for SG-jNMF is O(N 2 M) + O(tPMNk).

Results and Discussion
3.1. Data Processing. TCGA project includes a lot of gene expression profile data, DNA methylation data, copy number variation data, protein expression data, drug sensitivity data, and so on. In-depth study of these data can help us to master the mechanism of cancer occurrence and development and provide technical support for prevention, diagnosis, and treatment of cancer. In this article, four cancer datasets which are all downloaded from TCGA (https://tcgadata.nci.nih.gov/tcga/), namely, PAAD, esophageal carcinoma (ESCA), cholangiocarcinoma (CHOL), and colon adenocarcinoma (COAD), are used in these experiments. Details are listed in Table 2. To avoid the matrix dimension problem in algorithm execution, the number of genes in the four datasets is aligned to 19,876. First, RPCA is used to reduce the effects of noise and redundant information [29]. Second, the same number of samples and characteristics is retained for multiomics data of the same kind of cancer. en, the matrices are normalized according to the standard deviation of the data such that each element of the matrix is evaluated between 0 and 1.

Clustering.
When SG-jNMF2 method projects multiomics data into a common sample space, it contains all the sample information provided by the input multiomics data. To assess the clustering performance of this method, SG-jNMF2 is used to cluster the tumor samples on CHOL, PAAD, COAD, and ESCA datasets. ere are four methods (iNMF, iONMF, iGMFNA, and jNMF) that perform the same experiments on the same datasets.

Selection of Parameters.
For SG-jNMF2, clustering performance is affected by the regularization parameters. In this experiment, we empirically set the same value for λ I with different omics data from the same cancer [30]. erefore, there are three parameters, λ, β, and K, that need to be adjusted. λ is the graph regularization parameter, β controls the sparsity of factorization, and K is the number of nodes in the undirected graph constructed in the manifold. From Figure 3, when K is set to 3, the accuracy on the four datasets reaches a maximum. As seen from Figure 4, λ should be set to 1,000 on PAAD. When λ is equal to 0.1, the accuracy on COAD can achieve the maximum. When λ is equal to 10 − 3 , 10 −5 , and 1, the accuracy on CHOL can achieve the maximum. When λ is equal to 10 4 , the accuracy on ESCA can achieve the maximum. From Figure 5, when β is set from 10 −5 to 10 1 for PAAD, the accuracy reaches the maximum. For ESCA and COAD, β should be set from 10 4 to 10 5 . For CHOL, the value of β does not matter much.

Evaluation Indicators.
Several indicators are used to evaluate the clustering performance of SG-jNMF2: accuracy, recall, precision, and F1-score. Accuracy is defined as where N is the total number of samples in the dataset and δ(x, y) is a singular function. When x is equal to y, the value of the function is equal to 1; otherwise, it is equal to 0. map(r j ) maps the clustering label r j to the real label s j . e other three indicators used to evaluate clustering performance are defined as follows:

Complexity
where TP means the number of true positives, FP is the number of false positives, and FN denotes the number of false negatives.

Results.
In this experiment, each algorithm was run fifty times to reduce the impact of random initialization on the clustering results. We compared the accuracy, recall, precision, and F1-score of the four methods with SG-jNMF2. e mean and variance in the results are shown in Table 3. As seen in Table 3, SG-jNMF2 achieves the highest values on the four indicators mentioned above, except the recall value on the ESCA dataset. e contributions of sparse and graph regularization constraints of the algorithm are listed in Table 4. Performance improvements are measured by Δ ind � (Ind i − Ind j )/(Ind j ), where Ind i is the indicator of SG-jNMF and Ind j is that of the comparison method. In particular, sparse constraints improve accuracy by 49.70%, and sparse and graph regularization constraints improve accuracy by 78.87% on the PAAD dataset. Recall and F1score achieve more than 50% improvement on the CHOL dataset. When sparse constraints are introduced, only the recall on ESCA is reduced by 0.53%. e results on other datasets have also improved to varying degrees. In summary, Complexity 7 the performance of the integrated NMF in analyzing multiomics data greatly improves by introducing sparse constraints and graph regularization constraints.

Identifying Co-DEGs.
First, three matrices (DM, GE, and CNV of PAAD) are input into the SG-jNMF1 model and are projected into a common feature space. Second, we sum the common feature matrix in rows. Finally, we sort the elements in the sum vector in descending order. e top 100 genes are selected as Co-DEGs. ese 100 genes are compared with pancreatic cancer genes exported from GeneCards (URL:http:// www.genecards.org). Co-DEGs with relevance scores above 4 are listed in Table 5. CDKN2A is frequently mutated or deleted in many tumors. It plays an important role as a tumor suppressor gene. Studies have shown that the mutation of CDKN2A is closely related to the development of pancreatic cancer in families [31]. It is frequently seen in many tumors that mutation and overexpression of CCDN1 can alter the process of the cell cycle. Wang et al. identified pancreatitis-associated genes and found that CCND1 was involved in the pathway of pancreatic cancer [32]. Research on transcriptome sequencing shows that PTF1A maintains the expression of genes in all cellular processes. Deletion of PTF1A leads to an imbalance, cell damage, and acinar metaplasia, which is directly related to the development of pancreatic cancer [33]. Scientists have explored the effects of GRP on human intestinal and pancreatic peptides. erefore, SG-jNMF1 can effectively integrate the information of multiomics data to identify Co-DEGs closely related to the disease.
We also use SG-jNMF1 to integrate three gene expression datasets from ESCA, CHOL, and COAD to identify Co-DEGs associated with all three diseases. Partially Co-DEGs and their relevance scores with ESCA, CHOL, and COAD are shown in Table 6. e relevance score of CHEK2 with ESCA is up to 77.66. Allelic variation in CHEK2 has a strong relationship with the risk of esophageal cancer [34]. Relevance score of CHEK2 with COAD is 29.65. e germline variation in CHEK2 is also closely related to the   [35]. Frequent mutations in BRPA have been widely reported in human malignancies, including esophageal cancer, cholangiocarcinoma, and colon cancer [36][37][38]. is provides a computational method for the study of Co-DEGs in multiple diseases.

Conclusions
In this paper, we propose an integrative matrix factorization method (SG-jNMF) used to analyze heterogeneous multiomics data. e novel method jointly projects multiomics data matrices into a common low-dimensional space. Two forms of SG-jNMF enable multiomics data to be analyzed from both the sample and feature perspectives. is integrative analysis method can consider the local association of data and decrease the interference of noise and redundant information in the heterogeneous multiomics data. Experimental results show that the new method is superior to existing methods in analyzing heterogeneous multiomics data. Another significant advantage of SG-jNMF is that it can flexibly handle multiple input data of various types. is flexibility means that the input data can be different types of data (GE, ME, CNV, etc.) for the same disease or the same type of data for different diseases. We can use this method to identify Co-DEGs associated with a particular disease and detect common Co-DEGs associated with several diseases.
is provides an efficient calculation method for biological and medical research. Next, we will use the correlation between Co-DEGs to build a gene coexpression correlation network, and further study the function of gene modules and related pathways.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.