Robust Nonnegative Matrix Factorization via Joint Graph Laplacian and Discriminative Information for Identifying Differentially Expressed Genes

Differential expression plays an important role in cancer diagnosis and classification. In recent years, many methods have been used to identify differentially expressed genes. However, the recognition rate and reliability of gene selection still need to be improved. In this paper, a novel constrained method named robust nonnegative matrix factorization via joint graph Laplacian and discriminative information (GLD-RNMF) is proposed for identifying differentially expressed genes, in which manifold learning and the discriminative label information are incorporated into the traditional nonnegative matrix factorization model to train the objective matrix. Specifically, L2,1-norm minimization is enforced on both the error function and the regularization term which is robust to outliers and noise in gene data. Furthermore, the multiplicative update rules and the details of convergence proof are shown for the new model. The experimental results on two publicly available cancer datasets demonstrate that GLD-RNMF is an effective method for identifying differentially expressed genes.


Introduction
Cancer is one of the most serious diseases that endanger the health of human being.Millions of people die of cancer every year.With the development of gene sequencing technology and other gene detection technologies, huge gene data have been generated [1,2].Therefore, it is important and challenging for scientists to find pathogenic genes from a large number of gene expression data.Microarray datasets on each chip usually contain many gene expression data, and the number of samples is far less than that of genes, which makes the identification of differentially expressed genes difficult [3].In addition, irrelevant or noisy variables may reduce the accuracy of the results.In recent years, many effective mathematical methods have been applied to identify differentially expressed genes.For example, principal component analysis (PCA) [4,5] and penalized matrix decomposition (PMD) [6] have been used to analyze gene expression data.Liu et al. used robust principal component analysis (RPCA) to discover differentially expressed genes [7].Zheng et al. employed nonnegative matrix factorization (NMF) on the selection of tumor genes [8].Cai et al. proposed an algorithm named graph regularized nonnegative matrix factorization (GNMF) for data representation [9].Wang et al. used robust graph regularized nonnegative matrix factorization (RGNMF) for identifying differentially expressed genes [10].A CIPMD (Class-Information-Based Penalized Matrix Decomposition) algorithm was proposed to identify the differentially expressed genes on RNA-Seq data, which introduced the class information via a total scatter matrix [11].The Consensus Clustering methodology was proposed for microarray data analysis by Giancarlo and Utro [12].
However, two characteristics of gene expression data pose a serious challenge to the existing methods.Firstly, a large number of researchers hold that gene expression data probably reside in a low dimensional manifold embedded in a high dimensional ambient space.Therefore it is critical to consider the geometrical structure in the original gene expression data.Manifold learning is clearly an effective method to preserve the data geometric structure embedded in the original gene expression data [13,14].Cai et al. proposed GNMF [9], in which the geometrical structure of data 2 Complexity was constructed by an affinity graph.Another variant of NMF called manifold regularized discriminative nonnegative matrix factorization (MD-NMF) was also introduced [15].MD-NMF considered both the local geometry of data and the discriminative information of different classes simultaneously.Long et al. proposed a method called graph regularized discriminative nonnegative matrix factorization (GDNMF) [16], in which both the geometrical structure and discriminative label information were considered in the objective function.Secondly, gene expression data often contain a lot of outliers and noise.However, existing methods cannot effectively eliminate outliers and noise.For example, least squares methods are sensitive to outliers and noise.In recent years, many researchers have been devoted to improving the robustness to outliers and noise.Zheng et al. proposed an algorithm named generalized hierarchical fuzzy -means [17], which is robust to noise and outliers.Wang et al. used  2,1 -norm to reduce the effect of outliers and noise [10].
A novel algorithm, which we call robust nonnegative matrix factorization via joint graph Laplacian and discriminative information (GLD-RNMF), is proposed to overcome the aforementioned problems together.The proposed algorithm preserves the geometric structure of data space by constructing an affinity graph and improves the discriminative ability by the supervised label information.To do so, a new matrix decomposition objective function by integrating the geometric structure and label information is constructed.In addition, we employ  2,1 -norm instead of  2 -norm on the error function and the regularization term to reduce the influence of outliers and noise.For completeness, we present that the convergence proof of our iterative scheme is also shown in the Appendix.Experimental results indicate that the GLD-RNMF algorithm has better results than other existing algorithms for identifying differentially expressed genes.
The remainder of the paper is arranged as follows.In Section 2, we briefly introduce some relevant mathematical foundation and propose the GLD-RNMF algorithm in detail.In Section 3, the results of differentially expressed gene selection using our GLD-RNMF method and the other four methods (GNMF, NMFSC, RGNMF, and GDNMF) are shown for comparison.Finally, we conclude this paper in Section 4.

Mathematical Definition of 𝐿
where y  is the th row of Y and Y is  ×  matrix. 2,1norm is interpreted as follows.Firstly, we compute  2 -norm of the vector y  and then compute  1 -norm of vector p(Y) = (‖Y 1 ‖ 2 , ‖Y 2 ‖ 2 , . . ., ‖Y  ‖ 2 ).The value of the elements of vector p(Y) represents the importance of each dimension. 2,1norm enables the vector p(Y) sparse to achieve the purpose of dimension reduction.

Manifold
Learning.The purpose of this work is to get the best approximation of the original data.We also hope that the new representation can respect the intrinsic Riemannian structure.Recently, many researchers hold that high dimensional data often reside on a much lower dimensional manifold.The "manifold assumption" could be that data points nearby in the intrinsic geometry structure are also close under the new basis.Therefore, they usually have similar characteristics and can be categorized into the same class.
In this paper, we employ manifold learning to achieve the aforementioned goal.For a graph with  vertices, each vertex corresponds to a data point.For each data point, we can find its  nearest neighbors and connect it with the neighbors.There are many ways to define the weight matrix W on the graph, for example, 0-1 weighting, heat kernel weighting, and dotproduct weighting.Considering that 0-1 weighting is the simplest and easy to compute, we choose 0-1 weighting as the measure in this paper.0-1 Weight.W  = 1, if and only if two nodes  and  are connected by an edge.That is, where N  (x  ) consists of  nearest neighbors of x  and the neighbors have the same label with x  .Therefore, the smoothness of the dimensional representation can be measured as follows: where tr(⋅) represents the trace of a matrix.B is a diagonal matrix and B  is the row sum (or column, because W is symmetric, W ∈  × ) of W; that is, B  = ∑  =1 W  .L = B−W is graph Laplacian matrix and L ∈  × .We measure the distance of two points in the low dimensional space by the Euclidean distance (s  , s  ) = ‖s  − s  ‖ 2 .

Nonnegative Matrix Factorization (NMF).
We review the standard NMF in this section.Although the algorithm has been widely used in many aspects, there are still many shortcomings.
Given  nonnegative samples [x 1 , x 2 , . . ., x  ] in   , arranged in columns of a matrix X ∈  × , in this paper, each row of X represents the transcriptional response of the  genes in one sample and each column of X represents the expression level of a gene across all samples.Letting matrices X ∈  × , V ∈  × , and H ∈  × , NMF decomposes X into the product of V and H; that is, X ≈ VH.
To ensure an approximate factorization X ≈ VH, two update rules are introduced [19].One of the objective functions is constructed by minimizing the square of the Complexity 3 Euclidean distance between X and VH.The optimization problem is described as follows: min where ‖ ⋅ ‖  denotes the matrix Frobenius norm.The corresponding optimization rules are as follows: ( The convergence of the above optimization rules has been proven [19].

Graph Regularized Discriminative Nonnegative Matrix Factorization (GDNMF)
. Supervised label information is added to the objective function of GNMF [16].The definition and iterative rules of GDNMF are presented below.
Class indicator matrix S ∈  × is defined as follows: where y  ∈ {1, 2, . . ., } is the class label of x  and  is the total number of classes in X.
The objective function of GDNMF is formulated as follows: min The corresponding optimization rules are as follows: where A ∈  × is initialized to a random nonnegative matrix in the algorithm. and  are nonnegative regularization parameters, respectively.Essentially, GDNMF incorporates the graph Laplacian and supervised label information into the objective function of NMF, which ensures the algorithm to keep consistent with the intuitive geometric structure of the data and improves the discriminative power of different classes.

Robust Nonnegative Matrix Factorization via Joint Graph
Laplacian and Discriminative Information (GLD-RNMF) 2.5.1.The Objective Function.For the purpose of dimension reduction, NMF represents original data X ∈  × by a product of a nonnegative matrix V ∈  × and coefficient matrix H ∈  × .The approximation error is calculated according to the squared residuals; that is, Due to the squared term in the objective function, smaller outliers can lead to larger errors.In this paper, we enforce  2,1 -norm constraint on the objective function to reduce the impact of outliers and noise.By employing  2,1 -norm on GDNMF model, we can formulate the objective function of GLD-RNMF as follows: min This objective function can solve high dimensional, negative, noisy and sparse data simultaneously, keep consistent with the intuitive geometric structure of data, and improve the discriminative power of different classes.

The Multiplication Update Rules of GLD-RNMF.
Although the objective function is not convex jointly about (V, H, A), it is convex in regard to one of variables in (V, H, A) when the others are fixed.The objective function can be expanded as follows: where Q and G both are diagonal matrices and the diagonal elements are as follows: in which  is an infinitesimal positive number.
In order to solve the optimization problem in (9), we introduce the Lagrange multipliers Φ, Ψ, and Ω for V, H, and A, respectively.Firstly, we formulate the Lagrange function of GLD-RNMF as follows: Taking the partial derivatives of   with respect to V, A, and H and setting them to zero and in view of tr(XY) = tr(YX) and tr(X  ) = tr(X), we get According to the KKT (Karush-Kuhn-Tucker) conditions [20], that is, Φ  V  = 0, Ω  A  = 0, and Ψ  H  = 0, we can obtain the following equations: Then we can get the multivariate updating rules as follows: The details of our method are described in Algorithm 1.The iterative procedure is performed until the algorithm converges.
Considering the three update rules above, we ensure the convergence of the algorithm by the following theorem.16), (17), and (18).
The detailed proof of the theorem is shown in the Appendix.

Results and Discussion
In order to verify the effectiveness of GLD-RNMF algorithm for identifying differentially expressed genes, we perform experiments on real gene expression datasets to compare our algorithm with the other four feature extraction algorithms: (a) GNMF algorithm (Cai et al. [9]); (b) NMFSC algorithm (Hoyer [21]); (c) RGNMF algorithm (Wang et al. [10]); (d) GDNMF algorithm (Long et al. [16]).We conduct these experiments on two publicly available cancer datasets: pancreatic cancer dataset (PAAD) and cholangiocarcinoma dataset (CHOL).

Identifying Differentially Expressed Genes by GLD-RNMF.
In this section, we use GLD-RNMF to identify differentially expressed genes.The matrix X with size  ×  is the original gene expression data.Each row of X indicates the transcriptional response of  genes in a sample.Each column of X indicates the expression level of a gene in all samples.Therefore, X can be written as follows: where V is the basis matrix with size  ×  and H is the coefficient matrix with size × and  ≪ min(, ).Since the matrix V contains all of the genes, the differentially expressed genes can be identified from the matrix V [10].By GLD-RNMF, the evaluating vector V is obtained in which elements are sorted in descending order: Generally, the larger the entry in V is, the more differential this gene is.Therefore, the differentially expressed genes can be obtained by the first num (num ≤ ) largest elements in V.
The objective of the experiment is to identify the differentially expressed genes by GLD-RNMF algorithm.The identifying process is described below.
(1) Obtain the nonnegative matrix X according to the genomic dataset.(2) Construct the label matrix S and the diagonal matrixes G and Q. (3) Gain the Laplacian matrix L and basis matrix V via GLD-RNMF algorithm.(4) Identify differentially expressed genes through the vector V. (5) Check differentially expressed genes by gene ontology tool.

Parameters Selection.
We assign parameters in our GLD-RNMF algorithm following the same way proposed by Long et al. [16].Distinguishingly, there are two parameters, that is,  and , in GLD-RNMF method.Fortunately, if  and  are set in a reasonable range they have little effect on the performance of the algorithm [15,16].
Algorithm 1: GLD-RNMF.In our experiments, we set  = 0.9 and  = 0.5 in the GLD-RNMF algorithm.Another important parameter in our GLD-RNMF algorithm is  which is used to construct a nearest graph.Empirically, we set  = 5 and adopt the mode as the heat kernel in LPP [22].Besides, we set the reduced dimension to 5 for all the methods.All the parameters in the other methods keep in line with those described in their paper [10,16,21,23].

Gene Ontology Analysis.
The gene ontology (GO) tool can interpret the genes that are input and discover the functions that these genes may have in common.As a webbased tool [24], GO Enrichment Analysis can find important GO items from a large number of genes and provide important information for the biological interpretation of highthroughput experiments.Another online tool that we use is ToppFun.It usually is used to interpret the differentially expressed genes.
To be fair, we extract 100 genes from the gene expression data by GNMF, NMFSC, RGNMF, GDNMF, and GLD-RNMF methods.Threshold parameters of ToppFun are set as follows: the maximum  value is set to 0.01 and the minimum number of gene products is set to 2.

Pancreatic Cancer Dataset.
Pancreatic cancer is a tumor with high malignancy, which is difficult to diagnose and treat.Early diagnosis of pancreatic cancer is not difficult and the mortality rate is high.The cause of pancreatic cancer is still not clear until now.In the experiment, there are 20502 genes in 180 samples contained in the dataset.
The top 10 GO items extracted and the  values of the five methods are listed in Table 1.In this table, "ID" and "Name" represent items and their names associated with the GO in the whole genome and the lowest  value of the five methods has been marked in bold font.As we find from Table 1, the  values generated by GLD-RNMF are much smaller than those by the other four methods for the PAAD.Therefore, GLD-RNMF method is more superior than the other four methods for the PAAD.The name of "GO:0030198" is extracellular matrix organization.It contains DPT (Dermatopontin), POSTN (Periostin), sec24d (Sec24-related protein d), and other genes which are related to pancreatic cancer [25][26][27].For example, POSTN can create a tumor-supportive microenvironment in the pancreas [28].Genes and gene products associated with extracellular structure organization (GO:0043062) can be found by GO tool, in which, DPT, POSTN, sec24d, uxs1 (UDP-Glucuronate Decarboxylase 1), fkrp (Fukutin Related Protein), and other genes have been illustrated to be associated with pancreatic cancer [29,30].For example, Sec24d is ubiquitously expressed but exhibits predominant expression in heart, placenta, liver, and pancreas.The other GO items can also be proven to be related to pancreatic cancer by some relevant literature material.
Clearly, GLD-RNMF method is an effective method for identifying differentially expressed genes.
Comparing 100 genes extracted by GLD-RNMF with what we obtain from Gene Cards (http://www.genecards.org/) about pancreatic cancer, 82 of the 100 genes are associated with pancreatic cancer.Many genes, which were previously thought to be unrelated to clinical outcomes, are identified.We present the top 10 of 82 genes with higher relevance scores in Table 2, including their gene ID, names, function, and related diseases.Among the identified differentially expressed genes, MMP2, MMP7, IGFBP3, INS, and the other genes have been demonstrated to be related to pancreatic cancer [31][32][33][34].For example, the effect of MMP-2 and its activators MT1-MMP, MT2-MMP, and MT3-MMP in pancreatic tumor cell invasion and the development of the desmoplastic reaction characteristic of pancreatic cancer tissues have been discussed [35].Akihisa Fukuda et al. demonstrated that serum MMP7 level in human pancreatic ductal adenocarcinoma patients is correlated with metastatic disease and survival.Conditioned medium from Capan-1 pancreatic cancer cells which contains abundant IGFBP-3 has been mentioned [36].Other genes identified by GLD-RNMF have been illustrated to be related to pancreatic cancer by some relevant literature materials as well.
On the other hand, we use Kyoto Encyclopedia of Genes and Genomes (KEGG) online analysis tool to analyze the differentially expressed genes identified by GLD-RNMF.In this experiment, putting the identified 100 genes into KEGG, we can obtain the corresponding disease pathway.Figure 1 is the pathway of pancreatic cancer.The genes that have been found by GLD-RNMF are marked with red.The chromosomal instability pathway records the disease progression from normal duct to pancreatic cancer.Infiltrating ductal adenocarcinoma is the most common malignancy of the pancreas.Normal duct epithelium progresses to the stage of infiltrating cancer through a series of histologically defined precursors.These differentially expressed genes contain two oncogenes: K-Ras and HER2/neu.p16, p53, BRCA2, and Smad4 are tumor suppressors.Slebos et al. assessed that K-ras oncogene mutations and p53 protein accumulation are associated with known or postulated risk factors for pancreatic cancer [37].Gu et al. found the expression of Smad4 and p16 is significantly lower in pancreatic cancer tissue compared with normal tissue.The lower expression of the proteins may impact the development of pancreatic cancer [38].From Figure 1, we can see that the pancreatic cancer pathway map contains six other pathways: PI3K-Akt signaling pathway, ErbB signaling pathway, MAPK signaling pathway, VEGF signaling, Jak-STAT signaling pathway, p53 signaling pathway, and TGF- signaling pathway.
PI3K-Akt signaling pathway is presented in Figure 2. From Figure 2, we can find four proteinases from the differentially expressed genes identified by GLD-RNMF.Therefore, our algorithm achieves better results.

Cholangiocarcinoma Dataset.
Cholangiocarcinoma is diagnosed in 12,000 patients in the US each year, but only 10 percent are discovered early enough to allow for successful surgical treatment.In our experiments, we apply the five methods on CHOL which contains 20502 genes on 45 samples.The 8 GO items closely related to cholangiocarcinoma   and the  values of the five methods are listed in Table 3.In this table, "ID" and "Name" represent items and their names associated with the GO in the whole genome.The lowest  value of the five methods has been marked in bold font.We can see from the table that GLD-RNMF is much better than the other four methods.Genes and gene products associated with blood microparticle (GO:0072562) can be found by the GO tool, in which APOA4 (Apolipoprotein A4), kng1 (Kininogen 1), and other genes have been illustrated to be associated with cholangiocarcinoma [39][40][41].The name of "GO:0060205" is cytoplasmic vesicle lumen.It contains ada (Adenosine Deaminase), DBH (Dopamine Beta-Hydroxylase), and other genes which are related to cholangiocarcinoma [42,43].The top 8 genes identified by GLD-RNMF are listed in   and other genes are successfully identified which represent potential biomarkers for cholangiocarcinoma and potential targets for clarifying the molecular mechanisms associated with cholangiocarcinoma.For example, HP was proposed as pronucleating proteins because they were highly expressed in the fast nucleating bile of patients with cholesterol stones [44].Waghray et al. predicted survival in patients with hilar cholangiocarcinoma by serum albumin [45].C3 and HP were identified as more abundant in cholangiocarcinoma [46].The relative documents can illustrate that the other genes identified by GLD-RNMF are associated with cholangiocarcinoma.

Conclusions
By introducing   rules in (18).We use an auxiliary function similar to what is used in the Expectation-Maximization algorithm [47].In the demonstration, we present the definition of auxiliary function [16].Definition G(ℎ, ℎ  ) is an auxiliary function of (ℎ) if the following conditions are satisfied.
(ℎ, ℎ  ) ≥  (ℎ) ,  (ℎ, ℎ) =  (ℎ) . (A.1) The auxiliary function is vital due to the following lemmas.It is enough to prove that each   is nonincreasing under the update rules because our update is essentially wised.Therefore, we present the following lemma.(A.7)

Figure 1 :
Figure 1: Pathway of pancreatic cancer, where pink background represents disease genes, light blue represents drug target genes, and light green represents human genes.Genes found by GLD-RNMF are marked with red.

Figure 2 :
Figure 2: P13K-AKT signaling pathway, where pink background represents disease genes, light blue represents drug target genes, and light green represents human genes.Genes found by GLD-RNMF are marked with red.

Table 1 :
Comparison of  values of different methods on PAAD.

Table 3 :
Comparison of  values of different methods on CHOL.