Nonlinear Model-Based Method for Clustering Periodically Expressed Genes

Clustering periodically expressed genes from their time-course expression data could help understand the molecular mechanism of those biological processes. In this paper, we propose a nonlinear model-based clustering method for periodically expressed gene profiles. As periodically expressed genes are associated with periodic biological processes, the proposed method naturally assumes that a periodically expressed gene dataset is generated by a number of periodical processes. Each periodical process is modelled by a linear combination of trigonometric sine and cosine functions in time plus a Gaussian noise term. A two stage method is proposed to estimate the model parameter, and a relocation-iteration algorithm is employed to assign each gene to an appropriate cluster. A bootstrapping method and an average adjusted Rand index (AARI) are employed to measure the quality of clustering. One synthetic dataset and two biological datasets were employed to evaluate the performance of the proposed method. The results show that our method allows the better quality clustering than other clustering methods (e.g., k-means) for periodically expressed gene data, and thus it is an effective cluster analysis method for periodically expressed gene data.


BACKGROUND
Many biological processes such as cell-cycle division exhibit periodic behaviors. To understand the mechanisms of these biological processes, DNA microarray experiments have been employed to produce gene expression profiles at a series of time points, for example, the cell division cycle processes of yeast Saccharomyces cerevisiae [1,2], bacterium Caulobacter crescentus [3], and human being [4]. Such timecourse gene expression data provides a dynamic snapshot of most (if not all) of the genes related to the biological development process. It is believed that clustering periodically expressed gene from their timecourse expression data could help understand the molecular mechanisms of those biological processes.
In past decade, a number of methods have been proposed for identifying and clustering periodically expressed genes. The discrete Fourier transform method is the earliest method for identifying and clustering periodically expressed genes [1][2][3][4]. In these papers, the discrete Fourier transform is applied to gene expression data to get a two-dimensional vector. One component of the vector is the sum of all coefficients of sine functions while another component is the sum of all coefficients of cosine functions. Then the magnitude of the two-dimensional vector is used to measure periodicity of time-course gene expression profile. The rather subjective cut-off value is taken to determine if a gene is periodically expressed. By this way, Spellman et al. determine that 800 genes are periodically expressed out of more 6000 gene expression profiles from yeast Saccharomyces cerevisiae. After performing cluster analysis, these 800 genes are divided into five groups [2]. However, microarray experiments typically generate short time-course data. As pointed in [5,6], the frequency resolution obtained on such short time-course data by the discrete Fourier transform is often not adequate for resolving periodicities of interest.
Authors in [7] propose a method called CORRCOS to find periodically expressed genes. CORRCOS generates totally 101000 periodic synthetic models. Each gene expression profile is compared to each of these 101000 models. Although it can identify periodically expressed gene, CORRCOS is too time consuming and the cross-correlation is not real metric. In [6], authors develop another algorithm named RAGE for detecting periodically expressed genes. Like CORRCOS, RAGE is a synthetic model-based method. Compared with CORRCOS, RAGE is less time consuming [6]. Wichert et al. [8] propose a statistical method to identify periodically expressed genes from their time-course gene expression profiles. The method models gene expression profiles also as sine functions use the Fisher g-test for statistical analysis. Given a time-course gene expression profile y t (t = 1, 2, . . . , m), the g-static is defined as where is called the periodogram. It is assumed that if a time-course gene expression profile has a significant sinusoidal component with frequency ω 0 ∈ [0, π], the periodogram exhibits a peak at that frequency with a high probability. On the other hand, if a time-course gene expression profile is purely random, the periodogram reduces to a straight line. Based on Fisher g-test [9], Chen [10] proposes a C&G procedure to identify periodically expressed genes from their time-course expression profiles. The g-statistic is effective only for evenly spaced gene expression profiles. For unevenly spaced gene expression profiles, Chen et al.
propose to use Lomb-Scargle periodograms to discover statistically significant periodic gene expression [11,12]. However, a recent research [13] has concluded that the Fisher g-test is poor if the time-course data is short and/or that data length is not an integer number of periods. Therefore, one can not expect to get a good clustering based on periodically expressed genes identified from these methods.
On the other hand, a number of clustering methods have been proposed for cluster analysis on gene expression data. These include distance/correlation-based clustering methods (e.g., hierarchical clustering [14], k-means clustering [15], and self-organizing maps [16]) and static model-based clustering methods [17,18]. In these methods, gene expression profiles are viewed as multidimensional vectors. Distance/correlation-based clustering methods cluster genes based on the distance/correlation among their expression profiles. Static model-based clustering methods assign genes to one of clusters if their expression profiles may be generated by a multivariate normal distribution. These methods do not take into account the dynamic of time-course gene expression data and thus are not efficient for periodically expressed gene data.
Recently, some dynamic model-based clustering methods have been proposed to analyze time-course gene expression data [19,20]. These methods employ autoregressive models to describe the dynamics of time-course gene expression data. As periodically expressed genes are associated with periodic biological processes, it is natural to model a periodically expressed gene data by periodic (nonlinear) function. This paper proposes a nonlinear model based method for clustering periodically expressed genes from their time-course expression profiles. The proposed method assumes that a periodically expressed gene dataset is generated by a number of periodical processes which are modelled by a linear combination of trigonometric sine and cosine functions in time plus a Gaussian noise term. A two-stage method is proposed to estimate the model parameters, and a relocation-iteration algorithm is employed to assign each gene to an appropriate cluster. A bootstrapping method and an average adjusted Rand index (AARI) are employed to measure the quality of clustering. One synthetic dataset and two biological datasets were employed to evaluate the performance of the proposed method.

Model for Periodically Expressed Gene Profiles
Let x(t) (t = 1, 2, . . . , m) be a time-course gene expression profile generated from a periodical biological process, where m is the number of time points at which gene expression is measured. After shifting the mean of gene expression profiles to 0, the periodicity of this time-course gene expression profile can be modeled by a linear combination of trigonometric sine and cosine functions in time plus a Gaussian noise term as follows [21] x(t) = a cos(ωt) + b sin(ωt) + ε(t), where a and b are the coefficients of sine and cosine function, respectively; ω is the frequency of periodic expression data, and ε(t) represent random errors. This study assumes that the errors have a normal distribution independent of time with the mean of 0 and the variance of σ 2 . This model is equivalent to sinusoidal function model [7,8,[10][11][12][13] which are widely used to generate the synthetic periodic gene expression profiles [7] and to detect the periodically expressed genes [2,8,[10][11][12]. In model (2.2), A = √ a 2 + b 2 is called magnitude and = arctan(a/b) is called the phase.
Given a time-course gene expression profile x(t) (t = 1, 2, . . . , m), estimating parameters a, b, and ω in model (2.1) is a nonlinear estimation problem as ω is nonlinear in the model. In general, all nonlinear optimization programs can be used to estimate parameters in model (2.1), for example, Gauss-Newton iteration method and its variants such as Box-Kanemasu interpolation method, Levenberg damped least squares methods, and Marquardt's method [22]. However, these iteration methods are sensitive to initial values. Another main shortcoming is that these methods may converge to the local minimum of the least squares cost function and thus cannot find the true values of the parameters.
Our observation is that noise-free model (2.1) can be viewed as the general solution of a following second-order ordinary differential equation and that ω 2 is linear in equation (2.4) which is independent of a and b. Therefore, we propose the following two-step parameter estimation methods to estimate parameters a, b, and ω in model (2.2).
Step 1. Numerically calculate the second derivative of x(t). Then, based on equation (2.4), use linear least squares method to estimate parameter ω 2 . In details, let then, by the least squares method, ω 2 is estimated as as time-course gene expression data are discrete, the second derivativeẍ(t) is estimated by the central finite difference formula as follows: where is time difference between two consecutive gene expression data points. From (2.7), the length of vectors X 2 and X 1 is m − 2. Note that if the value of ω 2 calculated by (2.6) for a gene is negative, this gene will be judged not to be periodically expressed.
Step 2. Substitute the estimated value of ω into (2.2). Apply the maximum likelihood method to model (2.1) to estimate parameters a and b. In detail, let

The Mixture Model
In this study, it is assumed that a time-course gene expression dataset is a collection of periodically expressed gene profiles which belong to several clusters, and profiles in each cluster can be described by model (2.1) or (2.2) with different parameters. Let θ k = [a k , b k , ω k , σ 2 k ] be parameters of model (2.1) for the kth cluster. Then the task of nonlinear model-based clustering is as follows: for a given number of cluster K , divide a time-course gene expression dataset into a partition C = {C 1 , . . . ,C k , . . . ,C K } using model where the parameters consist of {θ k , k = 1, . . . , K }.

Estimation of Model Parameters
According to the parameter estimation method proposed in previous section for a single time-course expression profile, for the kth cluster parameters, θ k = [a k , b k , ω k , σ 2 k ] can be estimated as where |C k | represents the number of time series in cluster C k , K k=1 |C k | = N .

Algorithm
This study employs a relocation-iteration algorithm as shown in Algorithm 1 to estimate the parameters such that the cost function (2.10) is minimized. In 2(a) of Algorithm 1, t represents the estimated parameters in cost function (2.10) at iteration t while, in 2(b), parameters a t k , b t k , and ω t k represent the parameters of model k at iteration t.

Evaluation
In this study, we use the adjusted Rand index (ARI) [23] to evaluate the quality of the clustering. Consider two partitions of N objects: the r -cluster partition U = {u 1 , . . . u r } and the s-cluster partition V = {v 1 , . . . , v s }. One may construct a contingency table (matrix) as in Table 1.
In Table 1, entry n i j is the number of objects that are both in clusters u i and v j , i = 1, . . . ,r , j = 1, . . . , s. Let n i. = s j=1 n i j and n . j = r i=1 n i j denote the sum of row i (i = 1, . . . ,r ) and the (1) Select an initial partition for given the number of clusters, K (2) Iteration (t = 1, 2,...): (a) estimate the parameter t based on the present partition by using (2.11); (b) generate a new partition by assigning each sequence x to cluster k for which the value of Stop if the improvement of the cost function (2.10) is below a given threshold, the cluster memberships of time series do not change.
Algorithm 1: Algorithm for nonlinear model-based clustering. Total n .1 n .2 · · · n .s n .. = n sum of column j ( j = 1, . . . , s) in the contingency matrix, respectively, and let V = N (N − 1)/2 (the number of pairs of N objects). Based on the contingency matrix of two partitions, the ARI is defined as (2.12) The expected value of ARI is 1 when they matched perfect and 0 when the two partitions are selected at random. If the true cluster labels for some dataset are known, the proposed clustering methods can be applied these datasets to obtain new cluster labels. Then ARI can be calculated for these two partitions. If ARI is close to 1, one can say that the proposed clustering method is in agreement with the true clusters. However, for real-life gene expression datasets, the true cluster labels are typically unknown. For this case, this study adopts a bootstrapping approach as shown in Algorithm 2 [20] to evaluate the proposed clustering methods. For the given number of clusters, K , the average ARI (AARI) reports the quality of the clustering result obtained from the evaluated clustering methods. Accordingly, the larger AARI, the better the quality of the clustering is, that is, the better the performance of the clustering method is.

EXPERIMENTAL RESULTS AND DISCUSSION
This study employs a synthetic dataset and two biological datasets to investigate the performance of the proposed method in different aspects.
(1) Repeat the following B times (where B is a preset integer number).
(a) Randomly divide the original dataset into two nonoverlapping sets, a learning set L, and a test set T .
(b) Apply the evaluated method to the learning set L to obtain a partition P (•, L).
(c) Construct a predictor (classifier) C(•, L) using the cluster labels from the partition P(•, L).
(d) Apply the predictor C(•, L) to the test set T to get the predicted partition P (•, T ).
(e) Apply the evaluated method to the test set T to obtain a partition P(•, T ).
(f) Calculate the ARI of partitions P(•, T ) and P(•, T ). (2) Calculate the average ARI (AARI) over the B times as the measure index of the proposed clustering method.
(3) For the various number of clusters, K , repeat the procedure described in steps (1) and (2) above to get AARI(K ), and then plot AARI(K ) with respect to K .

Synthetic Dataset (SYN)
The synthetic dataset is generated by model (2.1). Let x it be the simulated expression (log-ratio) values of gene i at time point t in the dataset, that is, where n is the number of genes, m is the number of time points, and K is the number of clusters. In this study, parameters for synthetic data a k , b k , and w k are randomly chosen as follows: where n k is the number genes in the kth cluster. The resulted parameters for synthetic data are shown in Table 2.
For various numbers of clusters, we run the proposed method described in Algorithm 1 with randomly chosen initial partitions, with the initial partitions from k-means results as and to the k-means 3    methods. The ARI between clustering results and the known true cluster labels is calculated. The values of AARI are calculated over 20 runs and shown in the Table 3 and Figure 1. From Figure 1, the proposed method with both initial partitions randomly chosen and those from k-means results has greater value of AARI than k-means when the number of clusters is greater than 3. Furthermore, when the number of clusters is the true value of 5, the AARI of the proposed method with both initial partitions reaches its maximum, which makes sense. However, the AARI of k-means method did not reach its maximum when the number of clusters is 5. Therefore, we can conclude that the proposed method outperforms the k-means in terms of AARI.

Real-Life Datasets
In this study, two real-life datasets are employed to illustrate the proposed method: ELU and BAC. ELU consist of expression profiles of 4304 genes without missing data. Expression profiles are obtained from yeast cell cycle division process through Eluration-synchronized experiments conducted by Spellman et al. [2]. Each expression profile has 14 equally spacing time points. BAC consists of expression profiles of 1590  genes without missing data. Expression profiles are measured during the cell cycle division process of the bacterium Caulobacter crescentus [3]. The measurements were taken at 11 equally spaced time points over 150 minutes. Both datasets are preprocessed in the following two steps.
Step 1. Shift the mean of each gene expression profile to 0.
Step 2. Filter the dataset with F-test at the significance level α, that is, where R H i is the sum of squared errors under the specific hypothesis and m is the number of time points.
Keep the genes which reject the null hypothesis (show periodical behaviours) [21].
After these two steps, the number of genes remains for different significant level as in Table 4. Then we run the evaluation procedure proposed in Algorithm 2 on these selected gene expression profiles. The AARIs of the proposed method and k-means over various numbers of clusters are plotted in Figures 2 and 3 for dataset ELU and BAC, respectively. From Figures 2 and 3 the results from both real-life datasets show that the proposed method outperforms the k-means in terms of AARI.

CONCLUSIONS
This paper has presented a nonlinear model-based method for clustering periodically expressed genes from their time-course expression profiles. In this method, profiles of periodically expressed genes and thus the cluster of profiles are modelled by a linear combination of trigonometric sine and cosine functions in time plus a Gaussian noise term which is equivalent to a sinusoidal function model [1-4, 6-13, 17-19]. Although this model is not new, the existing methods are not based on parameter estimation technique, especially not estimating the frequency in the model as it is nonlinear in parameter. In the presented method, a two step linear least squares method is proposed to estimate all model parameters including the frequency for each clusters. Computational experiments on one synthetic dataset and two biological datasets show that the proposed method outperforms the traditional clustering methods such as k-means in terms of AARI, which indicate that the proposed method can effectively cluster periodically expressed genes from their time-course expression profiles.