Variable Selection and Joint Estimation of Mean and Covariance Models with an Application to eQTL Data

In genomic data analysis, it is commonplace that underlying regulatory relationship over multiple genes is hardly ascertained due to unknown genetic complexity and epigenetic regulations. In this paper, we consider a joint mean and constant covariance model (JMCCM) that elucidates conditional dependent structures of genes with controlling for potential genotype perturbations. To this end, the modified Cholesky decomposition is utilized to parametrize entries of a precision matrix. The JMCCM maximizes the likelihood function to estimate parameters involved in the model. We also develop a variable selection algorithm that selects explanatory variables and Cholesky factors by exploiting the combination of the GCV and BIC as benchmarks, together with Rao and Wald statistics. Importantly, we notice that sparse estimation of a precision matrix (or equivalently gene network) is effectively achieved via the proposed variable selection scheme and contributes to exploring significant hub genes shown to be concordant to a priori biological evidence. In simulation studies, we confirm that our model selection efficiently identifies the true underlying networks. With an application to miRNA and SNPs data from yeast (a.k.a. eQTL data), we demonstrate that constructed gene networks reproduce validated biological and clinical knowledge with regard to various pathways including the cell cycle pathway.


Introduction
Generally, joint estimation of mean and covariance has been developed to address problems related to biomedical data. In longitudinal data analysis, for instance, identifying correct correlation structures within each subject is a major focus and many studies come up with variants of joint mean and covariance estimation to enhance statistical efficiency (see Ye and Pan [1] and references therein). With regard to graphical models, particularly, conditional Gaussian graphical models (cGGM) aiming to elucidate conditional dependence structures subject to the mean have increasingly received much attention [2][3][4][5] and conceptually can also be viewed as examples of joint mean and covariance estimation in essence. On a methodological side, joint estimation has been found to be practically applicable in the sense that applications of joint estimation methods cover bioinformatics, such as hormone and transcriptome [1,4,[6][7][8], and stock price analysis [9]. A majority of existing methods involve variable selection of covariates for the mean vector and sparsity of covariance mainly based on the penalization methods. However, in this paper, we purposely focus on variable selection and joint estimation with a pure implementation of the likelihoodbased method, unlike common penalization techniques, to better analyze eQTL data in light of genetic feature selection.
Over the decades, genomics research has focused on comprehensive understanding of regulatory networks in the context of system biology. Commonly we are interested in a gene network, which pictures the interplay among genetic factors (e.g., gene regulation and activation). Particularly, it is important to investigate how a given genotype (genetic variants) at a particular quantitative trait locus (QTL) affects measured phenotypes and traits at that locus. For instance, gene expression quantitative loci (eQTL) make use of gene expressions as quantitative traits. eQTL analysis has been widely applied to figure out the effect of genetic perturbations associated with diseases as well as to construct regulatory networks describing how genes regulate expressions of other genes [10,11]. More precisely, the location of single nucleotide polymorphisms (SNPs) may affect multiple gene expression 2 Computational and Mathematical Methods in Medicine levels, and this accidentally causes misleading inference for dependency structure among genes [4]. Many popular methods [2][3][4][5]12] have been introduced to identify gene networks, aiming at learning networks subject to perturbation effects by genetic variants on the basis of population gene expression and genotype data.
Yin and Li [4] proposed the sparse conditional graphical Gaussian model with ℓ 1 and the adaptive lasso penalty function. Li et al. [3] suggested the two-stage estimation framework: (1) estimating a nonsparse conditional covariance matrix of genes based on a conditional variance operator between the reproducing kernel Hilbert spaces of marker genes and then (2) using 1 and adaptive lasso penalty to obtain sparse estimates of a precision matrix under the cGGM. Cai et al. [12] studied the covariate-adjusted precision matrix estimation (CAPME) method using the constrained ℓ 1 optimization.
While most recent studies encourage sparsity estimation of a precision matrix based on the penalized likelihood, we instead rely on classical variable methods based on the standard likelihood. Strictly speaking, the penalized likelihood, according to its definition, cannot be viewed as the likelihood. It naturally poses a question whether the likelihoodbased method performs better than the penalized likelihood approaches. To address this question, we consider a joint mean and constant covariance model (JMCCM) inspired by Pourahmadi [13] and propose methods for variable selection which effectively identify sparse conditional gene networks and covariates relevant to gene regulations. We employ the modified Cholesky decomposition to guarantee positive definiteness of an estimated precision matrix [13]. With this reparametrization of the precision matrix, the log-likelihood function corresponding to our model can be decomposed into an additive form of each response in terms of Cholesky parameters. This facilitates a coordinate descent type implementation we adopt for the precision matrix estimation. The combination of both generalized cross-validation (GCV) and Bayesian information criterion (BIC) performs variable selection as a benchmark. Rao and Wald statistics are also utilized to add and delete genetic markers for each gene expression and the Cholesky factors for the precision matrix. To the best of our knowledge, only a few works use Rao and Wald statistics for variable selection in the joint estimation problem, particularly with an application to eQTL analysis.
In simulation studies, extensive simulation scenarios experimentally confirm improved estimation efficiency and precise variable selection of the proposed method. For real data applications, we perform eQTL analysis via the JMCCM with gene expressions and SNPs of yeast data, in pursuit of detecting the effects of variant perturbations and underlying gene regulations. We find that the JMCCM effectively uncovers biological pathways that may potentially account for known biological processes. Taken together, the JMCCM is shown to be effective in identifying conditional dependence structures among variables compared to the existing graphical models with penalization methods such as the sparse cGGM [5] and CAPME [12].
The paper is outlined as follows. In Section 2, we describe our JMCCM with modified Cholesky decomposition and maximum likelihood estimates. The variable selection algorithm using Rao and Wald statistics for SNPs and Cholesky factors is explained in Section 3. Section 4 deals with simulation studies to demonstrate performance of variable selection and estimation of the proposed model. The yeast cell cycle pathway genes with SNPs data analysis are presented in Section 5. Concluding remarks and discussion in Section 6 are followed by the Appendix, which provides mathematical details of our method.

JMCCM with the Modified Cholesky Decomposition.
Contrary to previous methods [3,4], the JMCCM primarily aims at simultaneous estimation over the mean and precision matrix of the Gaussian graphical model. Suppose that ( , ) is a pair of the × 1 vector of genetic markers and the × 1 random vector of expression levels. Let denote the vector of SNPs and let denote the gene expression traits following -variate multivariate normal distribution with the mean ( ; ) and covariance matrix Σ as follows: where the th entry of ( ; ) is ⊤ for = 1, . . . , , and = ( 1 , . . . , ) is the × 1 linear regression coefficient vector indicating effects of SNPs perturbations to gene expressions and = ( 1 , . . . , ). Importantly, note that Σ does not depend on . The coefficient is assumed to be sparse, since each gene is known to have only a few genetic regulators according to Cai et al. [12]. The precision matrix represents a gene network (graph), as in an undirected GGM, by corresponding the nonzero ( , )th element of the precision matrix with an edge between two vertices and [14]. This edge represents conditional dependence of genes and given all other gene expression levels. Thus, our goal is to identify nonzero entries of the precision matrix in order to construct a conditional dependence genetic network after the effects of SNPs perturbations are removed. The precision matrix Σ −1 is also expected to be sparse [4].
One of our primary interests is to estimate the precision matrix Σ −1 , which is symmetrically positive definite, so we need to ensure that the estimate of the precision matrix also satisfies symmetrical positive definiteness. To this end, we apply the modified Cholesky decomposition [13] to the precision matrix, denoted by K, as follows: where C is an upper triangular matrix with diagonal entries 1 and above-diagonal elements consisting of negative of = ( 12 , . . . , 1 , 23 , . . . , 2 , . . . , −1, ) and D is a diagonal matrix containing = ( 1 , . . . , ) with > 0 for all = 1, . . . , as diagonal entries. Here, the superscript "⊤" denotes the transpose of a matrix. Positive definiteness of K is shown in Appendix A. Throughout this paper, a vector in the parenthesis is considered as a column vector. Let = ( , +1 , . . . , ) for = 1, . . . , − 1. Then we can write = ( 1 , . . . , −1 ). The parameter space for JMCCM is defined by Θ = { = ( , , ) : ∈ R , ∈ R ( −1)/2 and ∈ R + }, where R + represents the set of -dimensional vectors of positive real numbers.

Consecutive Variable Selection Algorithm
As mentioned above, and K are commonly believed to be sparse in genomic data analysis. To address sparsity, the lasso-type penalty imposed on both regression coefficients and precision matrix has been popularly applied to diverse graphical models [4,5].
Stepping aside the lassotype approach, we develop a variable selection technique that mainly relies on the combination of classical variable selection methods. Generally, the numbers of SNPs and genes tend to be considerably huge so that computational costs normally become prohibitive. In order to address this problem, the proposed variable selection algorithm proceeds with largely two stages: (1) preliminary variable selection for mean and precision matrix and (2) secondary variable selection in the middle of the joint model estimation. It is important to note that the first stage leads possible variables to be limited in scope (i.e., working parameters in the model) in order to circumvent high computational complexity.

Preliminary Variable Selection.
Preliminary variable selection is largely twofold: variable selection for the mean part and covariance part. The idea behind that is to add variables (or equivalently parameters) to the joint model with the maximum Rao statistic and to delete ones with the minimum Wald statistic. You may refer to Koo [15] for the basis selection method or Kooperberg et al. [16] that explain variable selection schemes based on Rao and Wald statistics.

Mean Part.
When it comes to the mean part, we carry out selecting predictor variables (i.e., SNPs) for each response variable (i.e., gene expression), dealing with a univariate multiple regression problem. In the addition stage, we start off with a model including only an intercept term. The MLE is used as estimator for and maximum Rao statistic (Rao [17]) is the criterion for adding a predictor together with GCV (Friedman [18]) as a stopping rule. In the deletion stage, Wald statistic is calculated to exclude predictor variables such that  Computê0 and GCV. (4) while GCV decreases do (5) Among predictors not in the current model, add a predictor having the maximum Rao statistic. (6) Updatê( ) using (5) and compute GCV.
end while (8) while GCV decreases do (9) Among predictors in the current model, delete a predictor having the minimum Wald statistic. (10) Updatê( ) using (5) and compute GCV. (11) end while (12) The optimal model is chosen by the minimum GCV.

Covariance Part.
The rationale behind variable selection in the precision matrix estimation is that each , = 1, . . . , − 1, is regressed on +1 , . . . , , and the regression coefficients are negative of off-diagonal entries of C( ), the by-product of the modified Cholesky decomposition (2) [13,19,20]. Clearly, this is one of the major benefits of reparametrization via Cholesky decomposition [13] in pursuit of improved interpretation. Subsequent to selection for predictor variables, we compute (̂( )). Given initial̂( 0) = 0, we start with computinĝ for = 1, . . . , as in (8) using (̂( )) obtained from Algorithm 1. Then compute Rao statistic to add one variable out of +1 . . . , to the current model that builds on the response variable . We repeatedly update for , where = 1, . . . , − 1. Next, we choose one variable, say , for = 1, . . . , − 1 and = + 1, . . . , , such that Rao statistic is maximized and thereby updatêand̂, respectively. When calculatinĝand̂, the BIC is computed, and the addition process stops if the BIC no longer decreases. At the completion of addition, we build a model with all selected variables as a full model and subsequently begin deletion from a full model. Deletion process is similar to the addition process except that the minimum Wald statistic is used for deletion in place of the maximum Rao statistic. Successive deletion continues until the BIC stops decreasing. In the last stage, the final model is also selected by the BIC. Details are presented in Algorithm 2. Once Algorithm 2 is finished, the number of included in the joint model no longer increases, while variable reduction could occur in the joint estimation (Algorithm 3, Section 3.2). Sparsity ofK is achieved through this variable selection scheme along with Algorithm 3.

Secondary Variable Selection in the Middle of the Joint Model Estimation.
With variables (i.e., parameters) selected by the preliminary stage above, we implement simultaneous parameter estimation for both mean and precision in the joint model and additional variable selection. OnceK is computed via Algorithm 2 with fixed̂( ), we start joint estimation by updatinĝ( ), which is formed with weight least square estimates as in (5) and weights coming fromK. Then̂and , ultimatelyK, are newly computed using updated (̂( )), and again this updatedK serves as a weight for updatinĝ ( ). Over the updates, the interplay between (̂( )) and K continues until the log-likelihood function converges.
Afterwards, deletion to current parameters begins by Wald statistic, excluding one parameter with the minimum Wald statistic. Finally, the BIC is used as a stopping rule for deletion and selection to finalize model estimation. Algorithm 3 contains details about this procedure. While the preliminary variable selection works for each mean and covariance part under the assumption of̂( ) orK are fixed, this joint estimation procedure is designed to improve estimation and selection accuracy by reflecting the changes of̂( ) andK.

Experimental Studies
In order to assess the performance of our proposed method, we carry out experimental simulations and compare the sparse conditional Gaussian graphical model (SCGGM), Zhang and Kim [5], covariate-adjusted precision matrix estimation (CAPME), Cai et al. [12] and joint model with lasso penalty (JML), and Jhong et al. [21], all of which are based on penalized likelihood approaches. The competing methods are run with their default setting regarding tuning parameters. To evaluate similarity, the estimated precision Computational and Mathematical Methods in Medicine
else (9) Set a linear regression model with response and predictors 's whose corresponding coefficients are already included in the joint model. Here ∈ { + 1, . . . , }. Compute Rao statistic for adding one predictor among +1 , . . . , whose corresponding 's are not in this linear model. (10) end if (11) end for (12) Among all 's not in the joint model, add one with the maximum Rao value. Denote this by for = 1, . . . , − 1 and = + 1, . . . , .
Updatê,̂using (8) as well asK and compute BIC. (14) end while (15) while BIC decreases do (16) Compute Wald statistics for all in the current model.
Delete one with the minimum Wald.
Updatê,̂using (8) as well asK and compute BIC. (19) end while (20) The optimal model is chosen by the minimum BIC.
Algorithm 2: Variable selection in the precision matrix estimation.
Updatê,̂andK using (̂( )) obtained from the previous step. (4) until the log-likelihood function converges. (5) while BIC decreases do (6) Compute Wald statistic of deleting each and in the current model.
Updatê,̂andK using (̂( )) obtained from the previous step. matrix and true matrix are benchmarked by the Steins loss function: whereK is an estimate of the true precision matrix K and | ⋅ | denotes the determinant of a matrix. The Frobenius norm of difference between K andK, denoted by ‖Δ‖ , where Δ = K −K, is also considered. In addition to the Steins loss function, to measure how efficiently our model recovers the true conditional dependent relationship among genes, specificity (SPE) and sensitivity (SEN) are used, as defined by where TN, TP, FN, and FP are the numbers of true negatives, true positives, false negatives, and false positives with regard to off-diagonal elements of a precision matrix. Here, we treat a nonzero entry of a precision matrix as "positive." To combine sensitivity and specificity, Youden's index (=SPE + SEN − 1) is   Inspired by Yin and Li [4], we generate simulation data sets in the form of eQTL data sets such that nonzero entries of a precision matrix, commonly called a link (or edge), are randomly assigned with probability 1 / , where is the number of genes and 1 is some positive constant. For a link generated at the ( , )th entry of the precision matrix, denoted by k , the corresponding element is sampled from the uniform distribution over [−1, 0.5]∪[0.5, 1]. For each row, off-diagonal elements are divided by the sum of their absolute values multiplied by 1.5. And we obtain the true precision matrix K by symmetrizing and setting diagonals as 1. To create the × regression coefficients matrix , we first generate a × indicator matrix that has 1 as entry with probability 2 / for some positive constant 2 . If the ( , )th element of this indicator matrix is 1, is randomly generated from ([ , 1] ∪ [−1, − ]), where is the smallest absolute value of K generated.
The small-scale simulation results in Table 3 suggest that the JMCCM produces better estimates than all other methods across all six models. Due to computational issues, we drop some results of JML from Table 3. By comparing the results of model 3 with 6 and 2 with 4, when = 1000 and the number of genes is fixed, we can see that JMCCM and CAPME show less changes in and ‖Δ‖ than they appear in the SCGGM as the number of SNPs increases. This indicates that JMCCM and CAPME are less subject to the increment of the number of SNPs than SCGGM, when modest size of genes is involved. Estimation performance of CAPME seems to be affected by the number of genes more easily than JMCCM because Stein loss and Frobenius norm of CAPME for models 1, 2, and 3 with fixed increase more rapidly than those of JMCCM do. For identifying structures of the precision matrix, JMCCM surpasses SCGGM in discovering nonzero elements (higher sensitivity) as complexity of the model rises. This is possibly due to the fact that SCGGM tends to produce sparse estimates more than the true precision matrix. Higher sensitivity implies that our proposed model is less likely to miss the influential conditional dependency among genes. While CAPME and JML score near 1 in sensitivity, they mark poor number in specificity because 5-fold cross-validation for CAPME and validation approach for JML with default setting for tuning parameters selection tend to choose small ones, leading to dense estimates. JMCCM produces higher Youden's index across all simulation scenarios compared to all other methods, and the performance gap between JMCCM and others is increasingly widened as the models increase in sample size and complexity. The results for the large-scale simulations (models 7-9) are summarized in Table 4 and  Table S3. Due to long computing time, the results of CAPME for models 8 and 9 are not reported. Overall, the results are consistent with the small-scale simulations. The gap between JMCCM and SCGGM in estimation performance widened as the complexity of the model increases (Table 4).
We also simulate SNPs to mimic the linkage disequilibrium (LD) which is known to be a common phenomenon in DNA sequence and to assess the performance of our approach. We randomly generate two groups of SNPs which lie on LD, each of which contains LD block including 10 correlated SNPs such that correlation is greater than 0.9. Together with these SNPs, 10 SNPs are generated independently from Bernoulli trials with probability of 0.5 for a total of 20 SNPs and 20 genes: = 20 and = 20. Table 5 presents the simulation results using these data of sample size = 500 and = 1000 with 50 repetitions. Compared to the results of model 4 in Table 3, which is based on 20 independent SNPs with = 500, LD is found to have little impact on all methods, except for slight decreases in Stein loss and Frobenius norm of SCGGM.

eQTL Analysis of Yeast Data
In this section, we apply the proposed algorithm to genomic eQTL (i.e., expression quantitative trait loci) data in order to examine whether the proposed method effectively recovers true dependency of gene expressions, which builds on known molecular mechanisms. To this end, we collect a set of yeast data [22], including polymorphic genotypes and mRNA expressions. The yeast data have been widely applied to elucidate the biological interactions between nucleotide polymorphisms and their responding genes (e.g., perturbation effects) [23][24][25][26]. Thus, our primary goal is to identify conditional dependency among genes with an adjustment of SNPs perturbations to each gene expression level.
The data sets are collected for two yeast parent strains, BY4716 (BY) and RM11-1a (RM), and their 112 segregants. We obtain SNPs for 1,260 loci to the exclusion of the redundant SNPs observed in neighboring genetic regions and leave 3,684 expression genes after screening out genes of missing more than 5%. In order to validate whether or     not the estimated gene network is consistent with unknown biological knowledge, we take the true signaling pathway for comparison. Out of 3,684 yeast genes, we purposely focus on a set of 64 genes that are ascertained in the cell cycle yeast pathway available in the KEGG database [24]. Together with 1,260 SNPs as predictors, we construct a gene network model of 64 genes. JMCCM selects 222 nonzero elements of the precision matrix and leaves nonzero regression coefficients, for a total of 111 links among genes. A total of 489 regression coefficients of SNPs over gene expression levels are included in the final model. Figure 2 displays the conditional gene network estimated by the proposed joint model. A pair of two genes is linked if an off-diagonal element of the estimated precision matrix is nonzero, and so 51 genes are shown to be linked and assemble in a module. Given this estimated gene network, we notice that the gene network is found to be concordant with the true cell cycle pathway. For instance, according to the true cell cycle pathway in Figure 1, MCM3 is linked to ORC3, ORC5, MCM7, and MCM4. MCM5 is connected to TAH11 and ORC1. PHO4 is closely linked to CLN2 and SIC1, whose molecular function is related to MAP kinase orthologs (i.e., MAPK pathway   With regard to genetic variant effects, JMCCM is shown to effectively identify some of the well-known direct genetic perturbations. Gene expressions are regulated by some genetic variants, which, unless otherwise taken into account, may falsely capture the interplay of genes in a network. Our founding includes Clb-specific Cdk inhibitor (SIC1) as influencing the molecular interface between cyclin-Cdk complexes (e.g., binding to and blocking Cdk1/Clb activity, ultimately to maneuver the timing of DNA replication (see Table S1 in Supplementary Materials; [27]). In addition, our gene networks demonstrate that SIC1 is strongly perturbed by CLB3, while SCGGM did not detect any perturbation effects. More interestingly, many previous experiments validated the association between SIC1 and CLB3 to uncover the underlying mechanism of the cell cycle [28][29][30]. Clearly, this implies that JMCCM does a better job in accounting for SNPs perturbation as compared to SCGGM.
Moreover, the gene module detected from JMCCM is shown in Table 6. Pertaining to these gene modules, we hypothesize if gene modules, each containing hub genes and their neighboring genes, are enriched with common biological processes or not. To test this hypothesis, we conduct the gene ontology (GO) enrichment analysis [31] over the detected gene network modules (see Table S2 in Supplementary Materials) from both JMCCM and SCGGM using Fisher's exact tests. Table 6 demonstrates that the proposed method outstandingly performs detecting modules biologically associated with many molecular processes, while none is detected by SCGGM. More importantly, the pathway of the organic acid metabolic process enriched in gene module 1 is also reported and validated on the basis of the network modules constructed with large-scale integration of yeast data in Zhu et al. [32].
Putting all things together, we conclude that the proposed method facilitates recovering the true SNPs perturbations in the midst of gene regulations and elucidating the underlying interplay of gene interactions. These fortes of the proposed algorithm are favored for reinforcing a priori biological knowledge and address a novel hypothesis related to clinical and translational potential.

Conclusion and Discussion
In this paper, we propose JMCCM to efficiently identify conditional dependent structures of gene expressions with adjustments to perturbation effects of SNPs. Contrary to the existing conditional graphical models, the precision matrix commonly used to reveal the true relationship among genes is parameterized via the modified Cholesky decomposition. The maximum likelihood estimates of the precision matrix were computed, while variable selection of SNPs and Cholesky factors are carried out separately and jointly by the GCV and BIC criterion. From experimental studies, it is clearly shown that JMCCM performs better than the existing penalization methods. Besides, JMCCM in the application to yeast cell cycle data successfully recovers many parts of the cell cycle pathway with adjustments of SNPs to each gene expressions level. Notably, the model entails the estimation of precision matrix, of which components are assumed to be constant. So, in the future, we may relax this somewhat strong assumption in the way that the model can parametrize over and in pursuit of more accurate estimation [13]. We leave this for next study.

A. Positive Definiteness of K
The matrix representation of the modified Cholesky decomposition (2)   Since > 0 for all = 1, . . . , , ∑ =1 ((C ) ) 2 = 0 if and only if C = 0, which cannot happen here because is a nonzero vector. Thus K is a positive definite matrix.

B. A Derivation of the Log-Likelihood Function
The probability density function of (1) is given by Denote the score function of ( ) by ( ( )). Observe that By solving the normal equation ( ( )) = 0, we have (5).