Reverse Engineering Sparse Gene Regulatory Networks Using Cubature Kalman Filter and Compressed Sensing

This paper proposes a novel algorithm for inferring gene regulatory networks which makes use of cubature Kalman filter (CKF) and Kalman filter (KF) techniques in conjunction with compressed sensing methods. The gene network is described using a state-space model. A nonlinear model for the evolution of gene expression is considered, while the gene expression data is assumed to follow a linear Gaussian model. The hidden states are estimated using CKF. The system parameters are modeled as a Gauss-Markov process and are estimated using compressed sensing-based KF. These parameters provide insight into the regulatory relations among the genes. The Cramér-Rao lower bound of the parameter estimates is calculated for the system model and used as a benchmark to assess the estimation accuracy. The proposed algorithm is evaluated rigorously using synthetic data in different scenarios which include different number of genes and varying number of sample points. In addition, the algorithm is tested on the DREAM4 in silico data sets as well as the in vivo data sets from IRMA network. The proposed algorithm shows superior performance in terms of accuracy, robustness, and scalability.


Introduction
Gene regulation is one of the most intriguing processes taking place in living cells. With hundreds of thousands of genes at their disposal, cells must decide which genes are to express at a particular time. As the cell development evolves, different needs and functions entail an efficient mechanism to turn the required genes on while leaving the others off. Cells can also activate new genes to respond effectively to environmental changes and perform specific roles. The knowledge of which gene triggers a particular genetic condition can help us ward off the potential harmful effects by switching that gene off. For instance, cancer can be controlled by deactivating the genes that cause it.
Gene expression is the process of generating functional gene products, for example, mRNA and protein. The level of gene functionality can be measured using microarrays or gene chips to produce the gene expression data [1]. More accurate estimation of gene expression is now possible using the RNA-Seq method. Intelligent use of such data can help improve our understanding of how the genes are interacting in a living organism [2][3][4]. Gene regulation is known to exhibit several modes; a couple of important ones include transcription regulation and posttranscription regulation [5]. While the theoretical applications of gene regulation are extremely promising, it requires a thorough understanding of this complex process. Different genes may cooperate to produce a particular reaction, while a gene may repress another gene as well. The potential benefits of gene regulation can only be reaped if a complete and accurate picture of genetic interactions is available. A network specifying different interconnections of genes can go a long way in understanding the gene regulation mechanism. The control and interaction of genes can be described through a gene regulatory network. Such a network depicts various interdependencies among genes where nodes of the network represent the genes, 2 Advances in Bioinformatics and the edges between them correspond to an interaction among them. The strength of these interactions represents the extent to which a gene is affected by other genes in the network. A key ingredient of this approach is an accurate and representative modeling of gene networks. Precise modeling of a regulatory network coupled with efficient inference and intervention algorithms can help in devising personalized medicines and cures for genetic diseases [6].
Various methods for gene network modeling have been proposed recently in the literature with varying degrees of sophistication [7][8][9][10]. These techniques can be broadly classified as static and dynamic modeling schemes. Static modeling includes the use of correlation, statistical independence for clustering [11][12][13], and information theoretic criteria [14][15][16]. On the other hand, dynamic models provide an insight into the temporal evolution of gene expressions and hence yield a more quantitative prediction on gene network behavior [17][18][19][20]. In order to incorporate the stochasticity of gene expressions, statistical techniques have been applied [13]. A rich literature is also available on the Bayesian modeling of gene networks [21][22][23][24][25][26]. Promoted in part by the Bayesian methods, the state-space approach is a popular technique to model the gene networks [27][28][29][30][31][32][33], whereby the hidden states can be estimated using the Kalman filter. In the case of nonlinear functions, the extended Kalman filter (EKF) and particle filter represent feasible approaches [33,34]. However, the EKF relies on the first-order linear approximations of nonlinearities, while the particle filter may be computationally too complex. A comprehensive review of these methods can be found in [35].
In this paper, the gene network is modeled using a statespace approach, and the cubature Kalman filter (CKF) is used to estimate the hidden states of the nonlinear model [36,37]. The gene expressions are assumed to evolve following a sigmoid squash function, whereas a linear function is considered for the expression data. The noise is assumed to be Gaussian for both the state evolution and gene expression measurements. As the gene network is assumed sparse, any simple mean square error minimization technique will not suffice for the estimation of static parameters. Therefore, a compressed sensing-based Kalman filter (CSKF) [38] is used in conjunction with CKF for reliable estimation of parameters. In case of statistical inference, it is essential to obtain some guarantees on the performance of estimators. In this regard, the Cramér-Rao lower bound (CRB) of the parameter estimates is used as a benchmarking index to assess the mean square error (MSE) performance of the proposed estimator which is evaluated here for a parameter vector. The performance of the proposed algorithm is tested on synthetically generated random Boolean networks in various scenarios. The algorithm is also tested using DREAM4 data sets and IRMA networks [39,40].
The main contributions of this paper can be summarized as follows.
(1) CKF is proposed for the estimation of states, and a compressed sensing-based Kalman filter is used for the estimation of system parameters. The genes are known to interact with few other genes only necessitating the use of sparsity constraint for more accurate estimation. The proposed algorithm carries out online estimation of parameters and is therefore computationally efficient and is particularly suitable for large gene networks.
(2) The Cramér-Rao lower bound is calculated for the estimation of unknown parameters of the system. The performance of the proposed algorithm is compared to CRB. This comparison is significant as it shows room for improvement in the estimation of parameters.
(3) The proposed algorithm is compared with the EKF algorithm. Using the false alarm errors, true connections, and Hamming distance as fidelity criteria, rigorous simulations are carried out to assess the performance of the algorithm with the increase in the number of samples. In addition, receiver operating characteristic (ROC) curves are plotted to evaluate the algorithms for different network sizes. It is observed that the proposed algorithm outperforms EKF in terms of accuracy and precision. The proposed algorithm is then applied to the DREAM4 10-gene and 100gene data sets to assess the algorithm accuracy. The underlying gene network for the IRMA data sets is also inferred.
The rest of this paper is organized as follows. Section 2 describes the underlying system model for the gene expressions. The proposed CKF algorithm in combination with CSKF for gene network inference is formulated in Section 3. The derivation of CRB is shown in Section 4, and the simulation results and their interpretation are presented in Section 5. Finally, conclusions are drawn in Section 6.

System Model
Gene regulatory networks can be modeled as static or dynamical systems. In this work, state-space modeling is considered which is an instance of a dynamic modeling approach and can effectively cope with time variations. The states represent gene expressions, and their evolution in time, in general, can be expressed as where is the total number of data points available, w is assumed to be a zero-mean Gaussian random variable with covariance Q = 2 I, and the function (⋅) represents the regulatory relationship between the genes and is generally nonlinear. The microarray data is a set of noisy observations and is commonly expressed as a linear Gaussian model [41] where k is Gaussian-distributed random variable with zero mean and covariance S = 2 V I and incorporates the uncertainty in the microarray experiments. In order to capture Advances in Bioinformatics 3 the gene interactions effectively, the following nonlinear state evolution model is assumed [33,34]: where is the total number of genes in the network and (⋅) is the sigmoid squash function This particular choice for the nonlinear function ensures that the conditional distribution of the states remains Gaussian [41]. The multiplicative constants quantify the positive or negative relations between various genes in the network. A positive value of implies that the th gene is activating the th gene, whereas a negative value implies repression [34,42]. The absolute value of these parameters indicates the strength of interaction.
The model given in (3) and (4) in the absence of any constraints may be unidentifiable and may result into overfitted solutions [43]. Assumptions on network structures are, therefore, necessary to obtain a connectivity matrix that agrees with the biological knowledge. In a gene regulatory network (GRN), the genes are known to interact with few other genes only. To this end, the coefficients s are estimated using sparsity constraints, as explained in the next section.
A discrete linear Gaussian model for the microarray data is considered which can be expressed at the th time instant as [41] Stacking the unknown parameters together, the parameter vector to be estimated is where Plugging the values of states from (3) into (5), it follows that where Thus, the gene network inference problem boils down to the estimation of system parameters b using the observations y , where the effective noise e is the sum of system and observation noises. The next section describes the proposed inference algorithm for sparse networks.

Method
In this section, the methodology proposed to infer the system parameters in (3) is described. The proposed cubature Kalman filter with sparsity constraints (CKFS) approach is succinctly illustrated in Figure 1. The specific details of this algorithm are as next presented.

Cubature Kalman Filter.
Kalman filter is a Bayesian filter which provides the optimal solution to a general linear state space inference problem depicted by (1) and (2) and assumes a recursive predictive-update process. The underlying assumption of Gaussianity for the predictive and the likelihood densities simplifies the Kalman filter algorithm to a two-step process, consisting of prediction and update of the mean and covariance of the hidden states. However, the presence of nonlinear functions in the state and measurement equations requires calculation of multidimensional integrals of the form nonlinear function × Gaussian density [36], which in general is computationally prohibitive. Several solutions to this problem have been proposed including the EKF, which linearizes the nonlinear function by taking its first-order Taylor approximation, and the unscented Kalman filter (UKF), which approximates the probability density function (PDF) using a nonlinear transformation of the random variable. Recently, a new approach, CKF, has been proposed which evaluates the integrals numerically using spherical-radial cubature rules [36].
The next two subsections briefly explain the working of Bayesian filtering and the CKF solution for the nonlinear multidimensional integrals. Advances in Bioinformatics the mean and covariance of the Gaussian posterior density are computed as follows: where denotes the expectation operator and x −1 is normally distributed with parameters (x −1| −1 , P , −1| −1 ). The third equality is a consequence of the zero-mean nature of Gaussian noise w and its independence from d . The estimatesx −1| −1 and P , −1| −1 are assumed to be available from the previous iteration. Here, P , | −1 is an estimate of the error covariance matrix.

Measurement Update.
Since the measurement noise is also Gaussian, the likelihood density is given by . As the measurements become available at the th time instant, the mean and covariance of the likelihood density are calculated as follows: The updated posterior density, obtained from the conditional joint density of states, and the measurements can be expressed as where is the cross-covariance matrix between the states and the measurements. Hence, the states and the corresponding error covariance matrix are updated by calculating the innovation z −ẑ | −1 and the Kalman gain K , The next subsection briefly describes the computation of high-dimensional integrals present in the equations above.

Computation of Integrals Using Spherical-Radial Cubature Points.
In order to determine the expectations in (10), using a numerical integration method, a spherical-radial cubature rule is applied. This method calculates the cubature points X , −1| −1 as follows [36]: where = √ℓ/2[1] , = 1, . . . , ℓ, ℓ = 2 denotes the total number of cubature points and U −1| −1 stands for the square root of the error covariance matrix; that is, The cubature points are updated via the state equation The propagated cubature points yield the state and error covariance estimatesx The integrals in (11) and (14) can be evaluated in a similar manner. The next subsection explains the estimation of parameters in the system.

Estimation of Sparse Parameters Using Kalman
Filter. The state estimates are obtained using the CKF as described in the previous subsection. In order to estimate the unknown parameters in the system model, one of the most commonly used methods involves stacking the parameters with the states and estimating them together. The estimation process performed in this manner is called joint estimation. Another method for the estimation of parameters consists of a twostep recursive process which is termed dual estimation. This process estimates the states in the first step, and with the assumption that states are known, parameters are estimated in the second step. These steps are repeated until the algorithm converges to the true values or until the amount of available observations is exhausted. This paper makes use of the latter technique. The vector b as defined in (6) is assumed to be evolving as a Gauss-Markov model. As discussed previously, the states are assumed to be known at this step. The system evolution equations can therefore be expressed as where stands for the i.i.d Gaussian noise and R is as defined in (8). It is observed that (19) is a system of linear equations with additive Gaussian noise, and therefore, the Kalman filter is the optimal choice for the estimation of Advances in Bioinformatics 5 parameter vector. The standard predict and update steps involved in Kalman filter are summarized as follows: where K denotes the Kalman gain and P represents the error covariance matrix.
The Kalman filter algorithm is based on an 2 -norm minimization criterion. As the gene networks are known to be highly sparse, the parameter vector is expected to have only a few nonzero values. A more accurate approach for estimating such a vector would be to introduce an additional constraint on its 1 -norm which is the core idea in compressed sensing [38,44]. Such an 1 -norm constraint provides a unique solution to the underdetermined set of equations [45]. Therefore, instead of a simple 2 norm minimization, the following constrained optimization problem is considered: The importance of this constraint can be judged by the fact that without it, the system would be rendered unidentifiable [43]. The problem (21) can be solved using a pseudomeasurement (PM) method which incorporates the inequality constraint (21) in the filtering process by assuming an artificial measurement ‖b ‖ 1 − = 0. This is concisely expressed as The value of the covariance matrix Σ = 2 I of the pseudonoise is selected in a similar manner as the process noise covariance in the EKF algorithm. However, it is found that large values of variances, that is, 2 ≥ 100, prove sufficient in most cases [38]. Further details on selecting these parameters can be found in [38,46]. The PM method solves (21) in a recursive manner for iterations using the following steps: At each th time instant, P , | andb | obtained from (20) are considered as initial values; that is,b 1 =b | and P 1 = P , | which is the error covariance matrix. The value of (1) Input time series data set y.
Update the parametersb using (23). is equal to the number of constraints, that is, the expected number of nonzero b s in the system. Possible ways for calculating include minimum description length (MDL) principle and Bayesian information criterion (BIC).

Inference Algorithm.
The network inference algorithm is summarized in Algorithm 1. The algorithm consists of a recursive process which repeats itself for the number of observations present in the time-series data. For each time sample, the state estimate is obtained using the CKF, and the parameter estimate is obtained using the KF. Since the parameters are expected to be sparse, the estimates are then refined further using the CSKF algorithm. This iterative process results in a simple and accurate algorithm for gene network inference while considering a complex nonlinear model.

Cramér-Rao Bound
The performance of an estimator can be judged by comparing it with theoretical lower bounds proposed in parameter estimation theory. The CRB establishes a lower bound on the MSE of an unbiased estimator [47]. In particular, the CRB states that the covariance matrix of the estimatorb is lower bounded by where the matrix inequality ⪰ is to be interpreted in the semidefinite sense and I(b) is the Fisher information matrix (FIM) The CRB for gene network inference can be calculated as follows. By stacking all the observations for = 1, . . . , , (7) can be written compactly in the matrix form where is a constant. The derivative of ln (y | b) can be expressed as It now follows that By taking the expectation of (29), the FIM in (25) is given by The inverse of the FIM in (30) can be used to place a lower bound on the estimation error of the parameter vector b. Figure 2 shows the comparison of MSE of CKFS algorithm with CRB as a function of number of samples for one representative gene from the eight-gene network considered in Section 5.1. It is observed that the MSE of the estimated parameters decreases with increasing number of samples.

Results and Discussion
The simulation results of the CKFS algorithm are discussed in this section. The performance is first tested on synthetic data obtained from randomly generated Boolean networks under various scenarios and performance metrics. The algorithm is then assessed on the DREAM4 networks and the IRMA network.

Synthetic Data.
Time-series data is produced from randomly generated Boolean networks using the system model (3) and (5). Two scenarios are considered for this purpose.
First, the comparison is performed by varying the number of sample size while keeping the network size fixed. The gene network consists of 8 genes and 20 vertices. In terms of network estimation, if the algorithm predicts an edge between two nodes which may not be present in reality, an error, referred to as false alarm error (F), is said to have occurred. Another situation is the indication of the absence of a vertex in the graph which in fact is present in the real network. This kind of error is termed missed detection (M). The summation of these two errors normalized over the total number of vertices in the network yields the Hamming distance. It is also important to consider the probability of predicting the true connections correctly which will be assessed by the true connections (T) metric. An algorithm with low Hamming distance and small false alarm error is particularly desirable as predicting an edge erroneously can be troublesome for biologists. True connections indicate the reliability of the predictions. Figure 3 illustrates the performance of the CKFS algorithm and that of the EKF algorithm proposed in [34] in terms of the metrics described above. It is important to mention here that the same system model is assumed by both CKFS and EKF algorithms for the purpose of this simulation. These metrics are the same as those used in [15]. The variances of both the system and measurement noises, 2 and 2 V , respectively, are taken to be 10 −5 in all the simulations and are assumed to be known. It is noticed that EKF has a slightly lower false alarm rate when the number of samples is small; however, as the number of samples increases, CKFS yields a lower false alarm error. The Hamming distance for CKFS is also smaller than EKF indicating lesser cumulative error. True connections show a consistent behavior for the two algorithms when the number of samples is increased where CKFS is able to predict connections more accurately. These experiments show the superiority of CKFS in terms of lower error rate.
To obtain a more rigorous evaluation, the performance of algorithms is then compared in a scenario which considers the sample size to be fixed and assumes networks of different sizes. The receiver operating characteristic (ROC) curves are plotted as performance measures. A higher area under the ROC curve (AUROC) shows more true positives for a given false positive, and therefore, indicates better classification. The performance of CKFS( , , ) and EKF( , , ) is shown in Figure 4, where stands for the number of nodes, Advances in Bioinformatics represents the number of edges, and denotes the time points. It is observed that the CKFS exhibits superior performance than the EKF for networks of different sizes.
The complexity of the two algorithms is compared for synthetically generated networks with number of genes equal to 10, 20, 30, and 40. The sample size is kept to 50 time points for each of these networks, and the run time for EKF and CKFS algorithms is calculated as shown in Table 1. It is noted that EKF is faster for smaller network sizes, but as the network size increases, the run time gets much larger than that for CKFS. The main reason for this is that EKF [34] estimates the states and parameters by stacking them together which requires large-sized matrix multiplications at each iteration.
The benefit associated with performing dual estimation, as in CKFS, is that the parameters are estimated separately from the states. Since the system is linear and one-to-one for parameters, inversion of much smaller matrices can be performed reducing the computational complexity of CKFS algorithm. CKFS is therefore particularly attractive for largesized networks.  (5,10,20), (10,12,20), and (15,19,20). The area under the ROC curve for CKFS is more than that for EKF for various sized networks. purpose [39,48]. In this section, the performance of the CKFS algorithm is evaluated using the 10-gene and 100gene networks released online by the DREAM4 challenge.

DREAM4 Gene
Five networks are produced using the known GRNs of Escherichia coli and Saccharomyces cerevisiae. The data sets for each of 10-gene network consists of 21 data points for five different perturbations. The inference is performed by using all the perturbations. The 100-gene network consists of data sets for ten perturbations. AUROC and area under the precision-recall curve (AUPR) are calculated for the five networks of both the data sets and shown in Tables 2 and 3, respectively. The quantities, precision and recall, are defined as = /( + ) and = /( + ), respectively. For comparison purposes, the values of the two quantities for time-series network identification (TSNI) algorithm that , and (c) Gold standard, inferred network using CKFS, and inferred network using ODE [39,40]. Black arrows indicate true connections, blue arrows indicate the edges that are correct, but their directions are reversed, and red arrows indicate false positives.  exploits ordinary differential equations are also given [39]. The CKFS algorithm is found to perform significantly better than the TSNI algorithm.

IRMA Gene
Network. In addition to synthetic data, it is imperative to test the algorithms using real biological data. In this subsection, the performance of the CKFS algorithm is assessed using the in vivo reverse-engineering and modeling assessment (IRMA) network [40]. This network consists of five genes. Galactose activates the gene expression in the network, whereas glucose deactivates it. The cells are grown in the presence of galactose and then switched to glucose to obtain the switch-off data which represents the expressive samples at 21 time points. The switch-on data consists of 16 sample points and is obtained by growing the cells in a glucose medium and then changing to galactose. The system and measurement noise variances for the CKFS are assumed to be identical as in the previous simulations. Figure 5 shows the inferred network, the gold standard, and the network inferred using TSNI. It is observed that the CKFS algorithm succeeds to predict most of the interactions while giving lower false positives.

Conclusions
This paper presents a novel algorithm for inferring gene regulatory networks from time-series data. Gene regulation is assumed to follow a nonlinear state evolution model. The parameters of the system, which indicate the inhibitory or excitatory relationships between the genes, are estimated using compressed sensing-based Kalman filtering. The sparsity constraint on the parameters is crucial because the genes are known to interact with few other genes only. The use of CKF and the dual estimation of states and parameters renders the algorithm computationally efficient. The performance of CKFS is evaluated for synthetic data for different network sizes as well as varying sample points. ROC curves, Hamming distance, and true positives are used for comparing the accuracy of inferred network with EKF. It is observed that CKFS outperforms the EKF algorithm. In addition, CKFS gives advantages over EKF in terms of smaller run time for large networks. The Cramér-Rao lower bound is also determined for the parameters of the model and compared with the MSE performance of the proposed algorithm. Assessment using DREAM4 10-gene and 100-gene networks and IRMA network data corroborates the superior performance of CKFS. Future research directions include incorporating the estimation of model order in the network inference algorithm.