Low-Rank and Sparse Matrix Decomposition for Genetic Interaction Data

Background. Epistatic miniarray profile (EMAP) studies have enabled the mapping of large-scale genetic interaction networks and generated large amounts of data in model organisms. One approach to analyze EMAP data is to identify gene modules with densely interacting genes. In addition, genetic interaction score (S score) reflects the degree of synergizing or mitigating effect of two mutants, which is also informative. Statistical approaches that exploit both modularity and the pairwise interactions may provide more insight into the underlying biology. However, the high missing rate in EMAP data hinders the development of such approaches. To address the above problem, we adopted the matrix decomposition methodology “low-rank and sparse decomposition” (LRSDec) to decompose EMAP data matrix into low-rank part and sparse part. Results. LRSDec has been demonstrated as an effective technique for analyzing EMAP data. We applied a synthetic dataset and an EMAP dataset studying RNA-related processes in Saccharomyces cerevisiae. Global views of the genetic cross talk between different RNA-related protein complexes and processes have been structured, and novel functions of genes have been predicted.


Introduction
Genetic interactions, which represent the degree to which the presence of one mutation modulates the phenotype of a second mutation, could be measured systematically and quantitatively in recent years [1,2]. Genetic interactions can reveal functional relationships between genes and pathways. Furthermore, genetic networks measured via highthroughput technologies could reveal the schematic wiring of biological processes and predict novel functions of genes [3]. Recently, several high-throughput technologies have been developed to identify genetic interactions at genome scale, including Synthetic Genetic Array (SGA) [4], Diploid-Based Synthetic Lethality Analysis on Microarrays (dSLAM) [5], and epistatic miniarray profile (EMAP) [6]. In particular, EMAP systematically construct double deletion strains by crossing query strains with a library of test strains and identify genetic interactions by measuring a growth phenotype. An score was calculated based on statistical methods for each pair of genes, while negative scores represent synthetic sick/lethal and positive scores indicate alleviating interactions [6].
Consequently, for each pair of genes, there are two different measures of relationship in EMAP platform. First, the genetic interaction score ( score) represents the degree of synergizing or mitigating effects of the two mutations in combination. Second, the similarity (typically measured as a correlation) of their genetic interaction profiles represents the congruency of the phenotypes of the two mutations across a wide variety of genetic backgrounds. So there are two important aspects in exploiting EMAP data. On the one hand, cellular functions and processes are carried out in series of interacting events, so genes participating in the same biological process tend to interact with each other. Therefore, algorithms that detect gene modules composed of densely interacting genes are of great interest. Within these modules, genes tend to have similar genetic interaction profiles; thus the submatrix for these genes tends to have a low-rank structure. On the other hand, the cross talks between modules are usually indicated by gene pairs with high scores (so that the genetic interaction is significant). Removing them results in better low-rank structure. Evocatively, these gene pairs are likely shadows over the low-rank matrix and connect different rank areas. These cross talks reveal the relationships of different biological process or protein complexes. Meanwhile, gene pairs exhibiting high absolute value of scores may encode proteins that are physically associated or be enriched in protein-protein interactions [7][8][9]. So the investigation of score is equally important. However, the current methodologies in genetic interaction networks analysis did not efficiently address these two important issues simultaneously.
In order to identify modules and between-module cross talks in genetic interaction networks, we employ the "lowrank and sparse decomposition" (LRSDec) to decompose EMAP data matrix into a low-rank part and a sparse part. We propose that the low-rank structure accounts for gene modules, in which genes have high correlations, and the sparsity matrix captures the significant scores. In particular, entries in sparse matrix found by LRSDec correspond to two sources of biologically meaningful interactions, withinmodule interactions and between-module links. In this paper, we focus our discussion of the sparse matrix on the results of between-module links, while the results of within-module interactions can be found in the Supplementary Material available online at http://dx.doi.org/10.1155/2015/573956 (Supplementary Data 1).
Low-rank and sparse of matrix structures have been profoundly studied in matrix completion and compressed sensing [10,11]. The robust principal component analysis (RPCA) [12] proved that the low-rank and the sparse components of a matrix can be exactly recovered if it has a unique and precise "low-rank + sparse" decomposition. RPCA offers a blind separation of low-rank data and sparse noises, which assumed X = L + S (S is the sparse noise), and exactly decomposes X into L and S without predefined rank(L) and card(S). Another successful matrix decomposition method GoDec studied the approximated "low-rank + sparse" decomposition of a matrix X by estimating the low-rank part L and the sparse part S from X, allowing noise, that is, X = L + S + e, and constrained the rank range of L and the cardinality range of [13]. It has been stated that GoDec has outperformed other algorithms before.
In this paper, we modified the GoDec matrix decomposition method and developed "low-rank and sparse decomposition" (LRSDec) to estimate the low-rank part L and the sparse part S of X. LRSDec minimizes the nuclear norm of L and predefines the cardinality range of S, while considering the additive noise e. Different from GoDec, which directly constrains the rank range of L, LRSDec minimizes its responding convex polytopes, that is, the nuclear norm of L. It has been proven that the nuclear norm outperforms the rank-restricted estimator [14]. Furthermore, if, in presence of missing data, LRSDec could impute the missing entries while decomposing, with no need for data pretreatment, while GoDec could not accomplish decomposition and imputation simultaneously, then we stated the convergence properties of our algorithm and proved that, given the two regularization parameters, the objective value of LRSDec monotonically decreases. By applying both methods to a synthetic dataset, we demonstrated the superiority of LRSDec over GoDec. Finally, we analyzed a genetic interaction dataset (EMAP) using our algorithm and identified many biologically meaningful modules and cross talks between them.

Model
Let X be an × matrix that represents a genetic interaction dataset, where is the number of query genes and is the number of library genes. We propose to decompose X as where L ∈ R × denotes the low-rank part and S ∈ R × denotes the sparse part, and e is the noise. Here, we introduce L ∈ R × to account for modules, in which genes are highly correlated. These modules correspond to protein complexes, pathways, and biological pathways, in which genes tend to share similar genetic interaction profiles [15]. S ∈ R × is introduced to account for significant scores, which are either gene pairs in the same module that have genetic interactions or cross talks among different functional modules.
Based on the assumptions above, we propose to solve the following optimization problem: minimize rank (L) , minimize card (S) where ≥ 0 is a regularization parameter that controls the error tolerance, and card(S) denote the number of nonzero entries in matrix S.
To make the minimization problem tractable, we relax the rank operator on L with the nuclear norm, which has been proven to be an effective convex surrogate of the rank operator [14] minimize ‖L‖ * , minimize card (S) where ‖L‖ * is the nuclear norm of L (‖L‖ * = ∑ =1 , where 1 , . . . , are the singular values of L and is the rank of L). However, missing data is commonly encountered in EMAP data, confounding techniques such as cluster analysis and matrix factorization. Here, we extend our basic model (3) to handle EMAP data with missing values by imputing missing entries in the matrix simultaneously when estimating low-rank matrix L and sparse matrix S. Suppose that we only observe a subset of X, indexed by Ω, and the missing entries are indexed by Ω ⊥ . In order to find a low-rank matrix L and a sparse matrix S based on the observed data, we propose to solve the following optimization problem:

Algorithm
Similar to GoDec, the optimization problem of (3) can be solved by alternatively optimizing the following two subproblems until convergence: In each iteration, we optimize the objective function by alternatively updating L and S. Firstly, the subproblem (5a) can be solved by [14]. For fixed S, the solution of (5a) is Here, ≥ 0 is a regularization parameter controlling the nuclear norm of estimated value L , where UDV is the Singular Value Decomposition (SVD) of W and here + = max( , 0). The notation T (W) refers to softthresholding [14].
Next, the subproblem (5b) in (3) could be updated via entry-wise hard thresholding of X − L for fixed L . Before giving the solution, we define an orthogonal projection operator P. Suppose there is a subset of dataset W, indexed by Ω; then the matrix W can be projected onto the linear space of matrices supported by Ω: And P Ω ⊥ is its complementary projection; that is, Then the solution of (5b) could be given as follows: where P is the orthogonal projection operator as defined above, Θ is the nonzero subset of the first largest entries of |(X − L )|. Then, the matrix (X − L ) can be projected onto the linear space of matrices supported by Θ: So far we have developed the algorithm for solving problem (3). As for problem (4), due to the existence of missing values, we took the optimization on the observed data, Ω. We updated L and S of the following optimization subproblems, respectively: The term ‖P Ω (X − L − S)‖ 2 is the sum of squared errors on the observed entries indexed by Ω.
The subproblem (11a) can be solved by updating L with an arbitrary initialization using [14] The solution of subproblem (11b) is where Θ is the nonzero subset of the first largest entries of |P Ω (X − L )|. Now we have the following algorithm.

Parameter Tuning
We have two parameters that need to be tuned in our models: and . Here, we propose a 10-fold cross validation strategy to select them. The idea is as follows: let Ω be the index of observed entries of X. We randomly partition Ω into 10 equal size subsets and choose training entries Ω 1 and testing entries We may solve problem (15) on a grid of ( , ) values on the training data: Then we evaluate the prediction error (15) on the testing data: The cross validation process is repeated for 10 times. Then we can find the optimal parameter ( * , * ), which minimizes the mean of the prediction error.

Synthetic Data.
We simulated a synthetic data and then applied LRSDec algorithm and GoDec algorithm to it. Specifically, low-rank part, sparse part, and noises are generated as follows.
(i) Low-rank part: the covariance matrix Σ is generated by HH , where H ∈ R × and H , ∼ N(0, 1). Here is the number of hidden modules. The random entries (ii) Sparse part: the non-zero entries in sparse matrix are generated from the tail of Gaussian distribution N(1, 2), whose upper quantile is = 0.01. We randomly selected 70% of them to assign the opposite sign. This is consistent with EMAP datasets, in which negative genetic interactions are much more prevalent than the positive ones.
A low-rank matrix with rank 25 and sparse matrix with cardinality 250 were generated, respectively. Now we have The first step is parameter training, and the result is showed in Figure 1. Minimal prediction error was achieved when = 250 and = 25, which coincides with the rank and cardinality of the synthetic data. This demonstrated the effectiveness of cross validation procedure.
Next, we compared the performance of LRSDec algorithm and GoDec algorithm by comparing their prediction error. The relative error is defined as where W is the original matrix andŴ is an estimate/approximation. As both algorithms are influenced seriously by the parameters, we compared the relative error of the two algorithms by given different parameters. To make the comparison simple, we only changed one parameter with another parameter fixed ( Figure 2). One can see that both algorithms reach the smallest relative if adopting the correct two parameters, and LRSDec outperforms GoDec. In the Supplementary Material, we also compare the performance of both algorithms under different noise setting, and the trends are the same.

Application to EMAP Data from Yeast.
We also applied our method to EMAP data interrogating RNA processing in S. cerevisiae, which consists of 552 genes involved in RNArelated processes [9]. This genetic map contains about 152,000 pairwise genetic interaction measurements with about 29% missing entries in data matrix. We applied our method to this EMAP data, denoted as X, to obtain two matrices, a low-rank matrix L and a sparse matrix S.X = L + S is the new complete data matrix with imputed missing entries. To exploit the quantitative information from EMAP data, we first subjected the entire low-rank matrix L to hierarchical clustering, an approach that groups genes with similar patterns of genetic interactions. It should be noted that using low-rank matrix L in cluster improved the performance of hierarchical clustering in detecting genetic interaction modules [16]. According to the clustering result, we reordered rows and columns of matrixX, so that the protein complexes and biological processes showed in [9] could be found ( Figure S1).
To help identify more modules of cellular functions and processes and reveal the relationships between them, we further analyzed the matrices L and S. Figure 3 is a flowchart of our strategy 1 to detect modules and cross talks between them through low-rank matrix L. In this paper, we define module as a cluster from hierarchical clustering (HC) that passes through GO-enriched filtering (Supplementary Section 3). The details are as follows. Firstly, we clustered the row genes of matrix L using hierarchical clustering (HC). Here, we adopted the average-linkage hierarchical clustering algorithm in which the distance of gene and gene is defined as 1 − |cor( , )|, where |cor( , )| is the absolute value of the correlation of genetic interaction profile of gene and gene . A cutoff needs to be applied for HC to cut the hierarchical clustering tree. We used the Jaccard index (Supplementary Section 4) to determine how well the predicted gene sets correspond to benchmark (theoretical) gene sets [17]. Here, GO functional gene sets are used as benchmark (theoretical) gene sets. The cutoff at which HC achieved the highest Jaccard index is used to cut the hierarchical tree. We calculated the Jaccard index of every "height" cutoff in hierarchical clustering from 0.2 to 0.95 by 0.05 interval. This step resulted in the best Jaccard index with height = 0.7 and 84 clusters. Now we got the clusters of row genes, in which genes act in a consistent manner across the entire column genes. Then we filtered the clustering results by a hypergeometric test that calculates the significance of enrichment of GO items, and the value was set to 0.01. The clusters enriched in GO functional categories are defined as row modules. Secondly, for each row module, we exploited modules of column genes based on this row gene module in matrixX by clustering the column genes of this submatrix ofX. Thirdly, we screened column clusters whose interactions with the row modules are      significantly enriched by Fisher's exact test (Table 1, value = 0.05). Here, we defined the positive genetic interactions as those gene pairs with genetic interaction scores ≥ 2.0 and negative as ≤ −2.5 [9]. The reduced gene sets of column genes were defined as column modules (corresponding to the row module). Next we identified the enriched GO functional categories of these column modules by mapping them to GO items (hypergeometric test). Finally, repeating these steps for all row modules, we identified the modules and intermodule genetic cross talk of the whole genetic interaction network (Figures 4-6), where red and green represent a statistically significant enrichment of positive and negative interactions. The cross talks constructed in these figures are the scores among genes in the original matrix. In the following, we will discuss many of the interesting connections that have been reported previously. The low-rank matrix found more functional modules than the original matrix ( Table 2). In Table 2, we cut the dendrogram at different heights and compared the clusters obtained from the low-rank matrix and that of the original matrix. Jaccard index [17] is used to determine how well the predicted clusters recaptured the benchmark gene sets ((a) GO slim and (b) GO BP FAT). The definition of Jaccard index can be found in the Supplementary Material. In the ideal situation where predicted clusters perfectly match the benchmark gene sets, the Jaccard index is 1. The larger the Jaccard index, the better the predictions. The clusters obtained from clustering of the low-rank matrix are more enriched with GO functional categories at varying cutoffs ( Table 2). Figure 4 gives an overview of the relationships among biological processes when GO slim (downloaded from SGD [18], a broad overview of all of the top GO categories) is used as GO items. We found that several sets of genes that have been known to function in the same biochemical processes contain predominantly positive or negative interactions, which was also observed in [9]. For example, genes classed as involved in RNA splicing and transcription are significantly enriched in negative genetic interactions (Figure 4, green nodes). In contrast, the module involved in protein folding has strong positive interactions (red node). In addition, Figure 4 also suggested that not all modules have consistent pattern of interactions (yellow node), which is reasonable in biological processes. Finally, several connections have been previously discussed. For example, there is good evidence for functional interactions between splicing and transcription in [19] and functional interactions between splicing and translation in [20]. Furthermore, [21] reported the cooperative relationship between protein folding and chromosome organization.
Actually, if we classify the GO functional categories to more fine items (GO BP FAT, downloaded from DAVID http://david.abcc.ncifcrf.gov/), we can get a more comprehensive network ( Figure 5). Many of the interaction results in Figure 5 have been reported before. For instance, genes involved in tubulin complex assembly process have negative genetic interactions with genes involved in tRNA wobble uridine modification process, supported by SGD, while the negative genetic interactions between RNA splicing process and tRNA metabolic process could also be found. Actually, the genetic interaction between RNA splicing and chromatin modification has been studied in [22]. And the balance of the interactions between the processing of ribonucleoprotein assembly of intronic noncoding RNAs and the splicing process regulating the levels of ncRNA and host mRNA can be found in [23]. Tubulin functionally relating to roles of the elongator complex in tRNA wobble uridine modification is supported by [24]. Moreover, epistasis and chromatin immunoprecipitation experiments indicating that the loss of Rrp6 (regulation of transcription) function is paralleled by the recruitment of Hda1 (histone deacetylase) have been reported by [25]. Finally, cotranscriptional recruitment of the mRNA export factor Yra1 realized by direct interaction with the 3 end processing factor Pcf11 was in [26].

Conclusion
In this paper, we have introduced a method named "LRSDec" to identify gene modules and cross talks between them in the genetic interaction network. LRSDec is based on low-rank approximation with regularization parameters and nearly optimal error bounds. We developed LRSDec to estimate the low-rank part L and the sparse part S of the original matrix X.
In the synthetic data, LRSDec performed better than another matrix decomposition algorithm "GoDec, " which has been shown to be other existing decomposition algorithms [13]. Then we applied our algorithm to a genetic interaction dataset to identify modules and cross talks between them. After the decomposition, subsequent analysis revealed many novel and biologically meaningful connections. Moreover, LRSDec could impute missing data while decomposing, which could not be accomplished by other decomposition algorithms. Actually, LRSDec will not be limited by the yeast genetic interaction data. As long as the dataset has internal lowrank structure and some sparse information, we can use the LRSDec algorithm to decompose the data matrix into addition of two matrixes and then analyze them separately. This algorithm could be used widely in the field of genetic interaction data analysis, image processing, and so on. We also had a try on the genetic interaction data of C. elegans in the Supplementary Material.