^{1}

^{2}

^{1}

^{2}

We present a statistical method to rank observed genes in gene expression time series experiments according to their degree of regulation in a biological process. The ranking may be used to focus on specific genes or to select meaningful subsets of genes from which gene regulatory networks can be built. Our approach is based on a state space model that incorporates hidden regulators of gene expression. Kalman (K) smoothing and maximum (M) likelihood estimation techniques are used to derive optimal estimates of the model parameters upon which a proposed regulation criterion is based. The statistical power of the proposed algorithm is investigated, and a real data set is analyzed for the purpose of identifying regulated genes in time dependent gene expression data. This statistical approach supports the concept that meaningful biological conclusions can be drawn from gene expression time series experiments by focusing on strong regulation rather than large expression values.

Novel gene expression technologies (e.g., microarrays, next-generation sequencing, etc.) make it possible to study the simultaneous expression of an ever increasing number of genes [

To date a variety of statistical methods have been developed for the analysis of time series microarray data, and even fewer for next generation sequencing. Clustering methods have been used extensively to deduce the function of previously unknown genes by comparing their expression profile to known genes in the same cluster [

Most of the existing analytic methods for discovery of gene regulatory networks rely on the preselection of a subset of genes that are subsequently analyzed. Unfortunately, these methods are only feasible if the set of genes from which the network is built is small (i.e., magnitude of hundreds, not thousands) and is typically justified by the underlying assumption that not all of an organism's thousands of genes are involved in a specific temporal process (e.g., the cell cycle). Under such an assumption arbitrary cutoff values are used to compare a gene's absolute or relative maximum expression to a control tissue (zero time point) to determine whether it should be included or excluded from further consideration [

Our proposed approach is called the KM-algorithm, it does not depend on any

The performance of the KM-algorithm is studied via simulation using gene expression time course data of varying size (gene number) and length (number of time points). While we have based our simulations on microarray technology, any technology can be assumed/employed. The ranking result as gained from our approach is evaluated by the position of the (simulated) regulated genes in the final list. These simulation studies provide both guidelines and recommendations for the minimum number of time point observations that a gene expression experiment should include in order to achieve a desired degree of statistical separation (i.e., accuracy) between regulated and unregulated genes.

A statistical model for any complex biological process such as gene regulation must make simplifying assumptions. The choice of a model is a compromise between flexibility of the model (being able to explain a large proportion of the observed variance) and simplicity of the model itself. In this work gene regulation is modeled through a discrete time state space model with hidden regulators and Gaussian error terms:

A modified EM-algorithm is employed to estimate the parameters of the state space model. The parameters that are of interest, and that need to be estimated in the state space model (

The KM-algorithm starts with random initial values for the model parameters and then alternates between the Kalman smoothing (KS) estimates of the hidden regulators

To update the model parameters, the likelihood function (

Based on the state space model (

In the state space model (

Before the generalized EM-algorithm can be employed for estimation of the model parameters, the dimension of the state space must be estimated. Conventional model selection methods such as Bayes Information Criterion (BIC) [

Since traditional model selection criteria are not well suited for this particular application, a method that is based on the autocovariances of the observed gene expression values is employed [

In practice the observed gene expression values,

A further advantage of using a method based on the autocovariances of the observed gene expressions rather than conventional model selection procedures is that it does not require fitting many models of different dimensions. Instead, the estimated autocovariances of the observed variables are computed from the observations directly. Since the model fitting step is computationally much more time extensive than the singular value decomposition of

In the state space model (

Many gene expression (time series) data sets have tens of thousands of observations. For example, the Arabidopsis ATH1 Affymetrix microarray represents more than 24000 genes. Since the numerical expense of estimating the parameters of a state space model increases quadratically in the number of observed genes, parameter estimation for the complete set of (genes) observations quickly becomes computationally challenging. To address this challenge without restricting the gene space, or limiting the KM-algorithm, one can randomly partition the data into several smaller subsets of approximately equal size. The KM-algorithm can then be implemented, with different initial starting values, repeatedly for each subset. When sufficient computing capacity is available, these calculations can be carried out in parallel. The regulation criterion results for genes from all subsets are collected, and the procedure is repeated with a different random partitioning of the data. The results from the KM-algorithm are then combined via averaging to yield a single result for each gene. Biologically, splitting the observed data into subsets has no ill effect if all regulators are unobserved components, such as protein levels. However, if some of the regulators are observed gene expression values themselves, then splitting the data set may potentially ignore any gene-gene interactions. Therefore, the random partitioning is repeated with different subsets, to accommodate possible gene-gene interactions.

Data of different sizes (gene numbers

For simulated data where some of the genes are simulated or known as regulated, performance of the KM-algorithm is evaluated by ranking the genes. A perfect ranking result is one in which the regulation criterion values of the regulated genes surpass those of all unregulated genes. Due to both technical and biological variation in the observations this is rarely the case, and therefore an objective measure that describes the “goodness of ranking” of the KM-results is required. The goodness of ranking (GR) measure used here is based on the average ranking positions of the regulated genes. It assigns a value of one to a perfect ranking and a value of zero to the average random ranking of regulated and unregulated genes. Note that negative GR-measure values are possible and will occur if the regulated genes are listed at the bottom of the ranking list.

The KM-algorithm is applied to a simulated data set with

In Figure

Percentages of correctly classified genes in the simulation whose results are depicted in Figure

Temporal variance | Regulation criterion | |||

Top | Regulated | Unregulated | Regulated | Unregulated |

1% | (0/10) 0% | (970/990) 98.0% | (10/10) 100% | (980/990) 99.0% |

5% | (2/50) 4.0% | (932/950) 98.1% | (19/50) 38% | (949/950) 99.9% |

10% | (7/100) 7% | (887/900) 98.6% | (19/100) 19% | (899/900) 99.9% |

20% | (7/200) 3.5% | (787/800) 98.4% | (20/200) 10% | (800/800) 100% |

Averaged regulation criterion for five applications of the KM-algorithm as applied to simulated data with

A major advantage of the regulation criterion that is applied here is that it is independent of the order of the unobservable regulators. Hence, the results from two or more applications of the KM-algorithm on the same data set with different initial starting values may be averaged to yield higher power in detecting regulated genes.

Figure

Average goodness of ranking (GR) measure (with standard error) for averaging

Figure

A common difficulty in many complex statistical models is selecting the appropriate model dimension. Because of the vast number of genes, microarray applications are especially challenging when selecting an appropriate model. The model selection method presented earlier is based on the autocovariances of the (gene) observations. For nine simulated data sets of different sizes and lengths the block Hankel matrix of estimated autocovariances is computed for maximum biological time lag

Standardized singular values of the block Hankel matrix of autocorrelations for maximum biological time lag

Because the maximum biological relevant time lag

For nine simulated data sets of different size

As seen in Figure

A more detailed discussion of the effects of misspecification of the maximum relevant time lag

For larger microarray experiments with thousands of genes the proposed partitioning method is demonstrated. The KM-algorithm is implemented five times for each data set that consists of

Table

The “goodness of rank" (GR) measure results for gene ranking obtained by applying the KM-algorithm to the complete and partitioned data sets with

Complete data set | 0.0421 | 0.1543 | 0.6434 |

Partitioning method | 0.2012 | 0.2982 | 0.5349 |

The partitioning method is applied to data generated by a well-studied yeast cell cycle experiment [

The KM-algorithm in combination with the partitioning method is applied to the original Spellman

Figure

Regulation criterion values obtained by applying the partitioning method to Spellman's CDC15 yeast data plotted against the maximum absolute expression value of each gene. Genes that have been found to be cell cycle regulated by Spellman are plotted as red dots. Four genes that received high regulation criterion values, yet were not found to be cell cycle regulated by the original Spellman analysis, are labeled.

An efficient approach to ranking genes according to their degree of regulation in the observed biological process is presented using the novel KM-algorithm. While the KM algorithm is implemented on gene expression data in a microarray setting, it is technology independent. The ranking that results from a KM analysis can be used to select genes for individual study or to fit regulatory networks with existing methods that rely on the preselection of a smaller subset of genes. The selection of genes according to regulation, rather than absolute expression or variation over time, is biologically more meaningful and has great potential to aid in the discovery of regulatory pathways and networks.

The major benefit of using a state space model in the proposed KM-algorithm is the inclusion of hidden regulators. This feature is especially important when the focus is on constructing regulatory networks, since it provides an opportunity to discover additional regulating genes that may not be in the current network. It is expected that the application of the KM-algorithm to situations where the regulators of gene expression are both known and unknown is fairly broad (e.g., transcription factors, DNA methylation, and cell external stimuli). Furthermore, since the model provides a convenient way to integrate the technical variation, that is an integral part of any technology, separately from the biological variation in the observed organisms, there is huge potential for novel discoveries.

It is not surprising that complex statistical models are required to represent both the complexity and dependence structure of gene regulatory networks. Parameter estimation for complex models with different sources of variation, and simultaneous gene observations on a large number of variables, is one of our greatest challenges. In particular, the estimation of parameters in a Bayesian network, such as a state space model, is an

The KM-algorithm, which ranks genes based upon their degree of regulation, is easy to implement, and the calculations are feasible even for very large microarray data sets. Simulations show that the quality of gene ranking for time series of medium length (

Gene regulatory networks are only one example of a more general biological pathway. Other applications include the study of an organism's metabolome or proteome over time [

This work is partially funded by the NSF Plant Genome Grant 0501712-DBI to RWD.