Nonlinear-Model-Based Analysis Methods for Time-Course Gene Expression Data

Microarray technology has produced a huge body of time-course gene expression data and will continue to produce more. Such gene expression data has been proved useful in genomic disease diagnosis and drug design. The challenge is how to uncover useful information from such data by proper analysis methods such as significance analysis and clustering analysis. Many statistic-based significance analysis methods and distance/correlation-based clustering analysis methods have been applied to time-course expression data. However, these techniques are unable to account for the dynamics of such data. It is the dynamics that characterizes such data and that should be considered in analysis of such data. In this paper, we employ a nonlinear model to analyse time-course gene expression data. We firstly develop an efficient method for estimating the parameters in the nonlinear model. Then we utilize this model to perform the significance analysis of individually differentially expressed genes and clustering analysis of a set of gene expression profiles. The verification with two synthetic datasets shows that our developed significance analysis method and cluster analysis method outperform some existing methods. The application to one real-life biological dataset illustrates that the analysis results of our developed methods are in agreement with the existing results.


Background
To understand the mechanisms of dynamic biological processes, DNA microarray experiments have been employed to produce gene expression profiles at a series of time points, for example, the cell division cycle processes of yeast Saccharomyces cerevisiae [1,2], bacterium Caulobacter crescentus [3], and human being [4]. Such time-course gene expression data provides a dynamic snapshot of most (if not all) of the genes related to the biological development process and thus can be useful in genomic disease diagnosis and genomic drug design. The challenge is how to uncover useful information from such data by proper analysis methods [5].
Although the behaviours of genome-wide genes can be monitored simultaneously with the current DNA microarray technology, not are all of monitored genes closely related to the biological process being studied or of interest. In addition, gene expression data are often contaminated by various noises or noisy genes. It is impossible to uncover some useful information without any preprocessing. Either excluding genes of interest or including noisy genes could degrade the significance of any analysis results. Therefore, it is critical to select the genes which are closely relevant to a biological process from gene expression profiles measured during the biological process. The selection of genes can be performed by the so-called significance analysis of gene expression profiles. Much attention has been paid to the significant analysis of static gene expression data over the past years. For gene expression data obtained from a pair of conditions (e.g., normal versus abnormal, or control versus treatment) with multiple replicates, one of the widely used approaches in early years is called the -fold change method [6,7]. The "fold change" method determines a gene to be significantly expressed if the ratio of expression values under two different conditions is greater than or less than 1/ , where is a userpreset positive number. This approach has been improved by a resampling (bootstrapping) method called SAM [8,9]. Another approach to the significance analysis is the use of 2 The Scientific World Journal -test, for example, on logarithm of the expression levels. In a -test, the means and variances of gene expressions from a pair of conditions are used to compute a normalized distance so-called -value. When the -value exceeds a certain threshold depending on the confidence level selected, gene expression data from a pair of conditions are considered to be significantly different. Although -fold and -test approaches can be extended to apply for the analysis of gene expression data with multiple conditions, for example, SAM [8,9] and RIT [10], these approaches need the assumption that multiconditional values are statistically independent. Therefore, it is not applicable to time-course gene expression profiles as they are not statistically independent but dynamically dependent. In recent year, we have developed several methods for significance analysis of time-course gene expression data. In [11,12], we employ linear regression models to detect the significantly differentially expressed genes. In [13,14], we employ nonlinear models to detect the periodically expressed genes.
Besides the significance analysis, the cluster analysis is another class of analysis methods to uncover the useful information from gene expression data [5]. A number of clustering methods have been proposed for cluster analysis of gene expression data. These include distance/correlation-based clustering methods (e.g., hierarchical clustering [15], -means clustering [16], and self-organizing maps [17]) and staticmodel-based clustering methods [18,19]. In these methods, gene expression profiles are viewed as multidimensional vectors. Distance/correlation-based clustering methods cluster genes based on the distance/correlation among their expression profiles. Static-model-based clustering methods assign genes to one of the clusters if their expression profiles are generated by a multivariate normal distribution. These methods do not take the dynamic of time-course gene expression data into account and thus are not efficient for periodically expressed gene data. Some dynamic-model-based clustering methods have been proposed to analyze time-course gene expression data [20,21]. These methods employ linear autoregressive models to describe the dynamics of time-course gene expression data. Recently we propose the nonlinear-modelbased method for clustering periodically expressed genes [22,23].
As measured from typical nonlinear biological systems, time-course gene expression profiles should display the nonlinear properties. In this paper, we propose nonlinear-modelbased methods for significance analysis and cluster analysis of time-course gene expression data. The proposed nonlinear model can be viewed as a generalization of many existing models [13,14,[20][21][22][23]. A two-step method is proposed to estimate the model parameter. An -test is employed to determine if a gene expression profile is significantly different from noisy data. A relocation-iteration algorithm is employed to assign each gene to an appropriate cluster. A bootstrapping method and an average adjusted Rand index (AARI) are employed to measure the quality of clustering. We employ two synthetic datasets to evaluate the performance of the proposed methods and apply them to one real-life biological dataset.

Nonlinear Model for Time-Course Gene Profiles. Let
( ) ( = 1, 2, . . . , ) be a time-course gene expression profile generated from a dynamic biological process, where is the number of time points at which gene expression is measured. Many nonlinear gene expression profiles contain a periodic component and a long-term decrease or increase component. In this study, we employ the following nonlinear model to describe time-course gene expression data: where parameter represents the degradation rate of periodicity; parameters and are the coefficients of sine and cosine functions, respectively; parameter is the frequency of periodic expression data; parameters and are the coefficients of linear function; and ( ) represents random errors. This study assumes that the errors have a normal distribution independent of time with the mean of 0 and the variance of 2 . This model generalizes several existing models; for example, setting = = = 0, model (1) is reduced to sinusoidal function model [24][25][26][27][28][29][30]: which is widely used to generate the synthetic periodic gene expression profiles [24] and to detect the periodically expressed genes [27][28][29]. In model (2), = √ 2 + 2 is called the magnitude and Φ = arc tan( / ) is called the phase. Setting = 0, model (1) is reduced to a model used in [13], while, setting = = 0, model (1) is reduced to a model used in [14,22]. As model (1) is the generalization of several existing models, it is expected that the analysis results based on this model are better than those reduced models.
To construct model (1) six parameters need to be estimated from a time-course gene expression profile ( ) ( = 1, 2, . . . , ). Obviously estimating those parameters in model (1) is a nonlinear estimation problem as and are nonlinear in the model. In general, all nonlinear optimization programs can be used to estimate parameters in model (1), for example, Gauss-Newton iteration method and its variants such as Box-Kanemasu interpolation method, Levenberg damped least squares methods, and Marquardt's method [31][32][33]. However, these iteration methods are sensitive to initial values. Another main shortcoming is that these methods may converge to the local minimum of the least squares cost function and thus cannot find the true values of the parameters.
Our observation is that noise free model (1) can be viewed as the general solution of the following second order ordinary differential equation: which is independent of and and The Scientific World Journal 3 Now we can see that constant parameters , , , and are linear in (4). As long as we get the first and second derivatives, we can easily estimate the parameters , , , and by the linear least squares method. Then we can get the estimation of , , , and from equations in (5). Finally we can use (3) to estimate the rest of parameters and . Therefore, we propose the following two-step parameter estimation methods to estimate all six parameters in model (1).
Step 1. Numerically calculate the first and second derivatives of ( ). As time-course gene expression data are discrete, the first and second derivatives of ( ) can be estimated by the central (second order) finite difference formulas as follows: respectively, where Δ is the time difference between two consecutive gene expression data points. If the number of data points in a gene expression profile is enough, one can choose a high order finite difference formula to get more accurate estimation of these derivatives.
Then, based on model (4), we use the linear least squares method to estimate parameter 2 . In detail, let ] . (8) From (6) and (7), we have = −2. Then by the least squares method, the paramters , , , and in model (4) can be estimated as Note that if the value of 4̂−̂2 calculated by (5) for a gene is negative, the expression of this gene will be judged not to be described by model (1).
Step 2. Substitute the estimated values of , , , and into (3). Then we apply the least squares method to model (1) by the least squares method, and can be estimated as where

Nonlinear-Model-Based Significance Analysis.
Significance analysis of gene expression data is to determine if a gene expression profile is significantly different from noisy data. This issue is not easy to answer through statistical inference [29,30] yet, especially for time-course gene expression profiles as their data points are not statistically independent. However, a practical way in the literature [27][28][29][30] is to perform a statistical hypothesis test whether the gene expression profile is pure normal white noise or it fits a certain model as specified by (1). Along with this way, this study tests the null hypothesis of versus the alternative hypothesis of ( 1 ) (see (1)).
where 2 0 is the residual of model (13) with estimated parameters, and where 2 1 is the residual of model (1) with estimated parameters. As the noise model (13) can be viewed as a special case of model (1), the statistic follows the -distribution with the degrees of freedom (5, − 6), according to statistics theory [21,23]. When the value of -statistic is large enough (greater than a specified threshold), model (13) is rejected; that is, the gene expression profile is not pure normal white noise, and otherwise the gene expression profile appears as white noises. According to degrees of freedom (which are related to the length of time-course data and the number of parameters in the models) and a significance level (typically, 0.01, 0.05, 0.1, 0.2, or the like) specified by a user, the threshold 4 The Scientific World Journal value can be determined from -distribution table or by using a -distribution table or a standard MATLAB function (" ", 1 − , 5, − 6), where is the significance level. A significance level is the probability that the null hypothesis is true. Therefore, the rejection of the null hypothesis at a smaller significance level indicates the more favourable to alternative hypothesis. That is, the smaller the significance level is, the more confidence one accepts that genes are not noises if its corresponding value of -statistic is greater than the threshold.

Estimation of Cluster
where where | | represents the number of time series in cluster , ∑ =1 | | = .

Algorithm for Clustering.
This study employs the following relocation-iteration algorithm to estimate the parameters such that the cost function (17) is minimized: (1) select an initial partition for the given number of clusters, ; (2) iterate ( = 1, 2, . . .): (a) estimate the parameter Θ based on the current partition by using (18) (3) stop if the improvement of the cost function (17) is below a given threshold, or the cluster memberships of time series do not change significantly.

Evaluation
In this section, we use two synthetic datasets to evaluate our proposed significance analysis method and cluster analysis method, respectively. To evaluate the significance analysis method, we generate one synthetic dataset that consists of 2000 noisy gene expression profiles based on model (13) and 2000 time-course gene expression profiles based on model (1). All 4000 expression profiles are depicted in Figure 1, from which one can hardly differentiate time-course gene expression profiles from noisy ones. To measure the performance of significance analysis, we employ two widely used indices: sensitivity and specificity, which can be defined as follows [34]: The Scientific World Journal The sensitivity and the specificity depend on thresholds which determine if an expression profile is time-course or noisy. In general, the sensitivity is increasing, while the specificity is decreasing and vice versa. However, a good method is expected to have both high sensitivity and specificity. Figure 2 depicts the curves of sensitivity versus specificity over various thresholds. From this figure, we can see that both sensitivity and specificity can be as high as of 99% for a specific threshold, which indicates that our proposed significance analysis methods are excellent.
To evaluate our proposed cluster analysis method, another synthetic dataset consisting of six clusters is generated from model (1), where different clusters have different randomly selected parameters with some large variances. In each cluster, all profiles are generated with model parameters for this cluster with some random perturbations. All generated profiles are plotted in Figure 3, from which one can see that all time-course gene expression profiles are mixed up. To measure the quality of clustering results, we use the adjusted Rand index (ARI) [35], which originally is to measure the degree of agreement between two partitions of the same set of objects. Given two partitions of objects, the -cluster partition = { 1 , . . . , } and the -cluster partition = {V 1 , . . . , V }, the ARI is defined as follows [35]: where is the number of pairs of objects, is the number of objects that are both in clusters and V , = 1, . . . , , = 1, . . . , , and . is the number of objects in cluster , while . is the number of objects in cluster V . From these definitions, we have The expected value of ARI is 1 when two partitions agree perfectly and 0 when they are selected at random. As the results of clustering are sensitive to the initial partition, we run our proposed clustering algorithm and competing clustering algorithms 30 times on the synthetic dataset and calculate the average ARI (AARI) for each algorithm. Figure 4 depicts the AARI of three algorithms named "algorithm with random initial, " "algorithm withmeans initial, " and " -means" over several different numbers of clusters, where "algorithm with random initial" means our proposed clustering algorithm with randomly chosen initial partition, "algorithm with -means initial" means our proposed clustering algorithm with -means result as initial partition, and " -means" is an algorithm coded in the MATLAB software for -means clustering method. Those values of AARI are also listed in Table 1.
From Figure 4 and Table 1, one can see that our algorithm with random chosen initial partitions outperforms the other two algorithms. Particularly, at the correct number of clusters, 6 The Scientific World Journal    the ARRI from our algorithm with random chosen initial partitions reaches its maximum. The quality of our algorithm with -means result as initial partitions is comparable with that of -means, which indicates that after -means falls in a local optimum, our algorithm cannot escape from that local optimum and thus inherits the drawbacks of -means. This also suggests that our developed algorithm should combine with random chosen initial partitions.

Applications to a Real-Life Gene Expression Data
In this section, we apply our proposed significance analysis and cluster analysis method to a real-life gene expression dataset which is collected from the alpha-synchronized experiment [2]. To study the mitotic cell division cycle of yeast, Spellman et al. [2] have monitored more than 6000 genes of yeast (Saccharomyces cerevisiae) at 18 equally spacing time points in the alpha-synchronized experiment. The original dataset is publicly available at http://genome-www.stanford.edu. Genes with missing data are excluded in this study. The resultant dataset contains the expression profiles of 4489 genes. We first apply our proposed significance analysis method to this dataset and set the significance level = 0.1. As a result, 846 genes are determined to be involved in the alphasynchronized cell division cycle process, while the other 3643 genes are determined to be noises with respect to this process. Figure 5(f) depicts these 3643 expression profiles. From Figure 5(f), most of the expression profiles look like noises and are not related to the alpha-synchronized cell division cycle process according to the results in [2]. Then we apply our proposed clustering algorithm to the subset consisting of 846 genes involved in the alpha-synchronized cell division cycle process. According to the biological meaning of this process [2], the reasonable number of clusters is 5. The model parameters identified for each of the five clusters are listed in Table 2. From Table 2, for all clusters the values of parameter are negative numbers, which are reasonable. As the cell division cycle is a stable biological system, after a perturbation such as the alpha synchronization, the system tends to its stable attractor. Therefore the degradation rate represented by should be negative.
Furthermore, the values of model parameters and determine the importance of periodic components. From Table 2, the module of parameters and is the largest, while the absolute value of parameter is small for Cluster 3. This indicates that 17 genes in Cluster 3 are periodically expressed in this process, which can be verified from Figure 5(c). Actually all 17 genes in this cluster have also been identified as periodically expressed genes in [2]. The module of parameters and is the second largest for Cluster 5, while the absolute value of parameter is very large for Cluster 5. As a result, gene expression profiles in Cluster 5 are quickly degrading while hardly displaying periodicity as The Scientific World Journal  The Scientific World Journal shown in Figure 5(e). According to the estimated values of model parameters, expression profiles in other clusters can similarly be explained.

Conclusions
In this paper, we have presented a significance analysis method and a cluster analysis method for time-course gene expression profiles. In these methods, time-course gene expression profiles are modeled by a nonlinear model, which is a generalization of several existing models. To estimate the parameters, which is key to the developed significance analysis method and a cluster analysis method, we propose a two-step linear least squares method. One synthetic dataset has been employed to verify our developed significance analysis method in terms of sensitivity and specificity, while another synthetic dataset has been employed to verify our developed cluster analysis method in terms of AARI. The results have shown that both of our developed methods outperform some existing methods. The application to one real-life biological dataset illustrates that the analysis results of our developed methods are in agreement with the existing results. The reconstruction of gene regulatory network from time-course gene expression data is a very important issue in systems biology [36]. Obviously, noisy genes should be excluded from gene expression data for reconstructing gene regulatory networks. In the future, we may combine our method with other methods as in [36] to reconstruct gene regulatory networks.