^{1}

^{2}

^{3}

^{3}

^{1}

^{2}

^{3}

Identification of rhythmic gene expression from metabolic cycles to circadian rhythms is crucial for understanding the gene regulatory networks and functions of these biological processes. Recently, two algorithms, JTK_CYCLE and ARSER, have been developed to estimate periodicity of rhythmic gene expression. JTK_CYCLE performs well for long or less noisy time series, while ARSER performs well for detecting a single rhythmic category. However, observing gene expression at high temporal resolution is not always feasible, and many scientists are interested in exploring both ultradian and circadian rhythmic categories simultaneously. In this paper, a new algorithm, named autoregressive Bayesian spectral regression (ABSR), is proposed. It estimates the period of time-course experimental data and classifies gene expression profiles into multiple rhythmic categories simultaneously. Through the simulation studies, it is shown that ABSR substantially improves the accuracy of periodicity estimation and clustering of rhythmic categories as compared to JTK_CYCLE and ARSER for the data with low temporal resolution. Moreover, ABSR is insensitive to rhythmic patterns. This new scheme is applied to existing time-course mouse liver data to estimate period of rhythms and classify the genes into ultradian, circadian, and arrhythmic categories. It is observed that 49.2% of the circadian profiles detected by JTK_CYCLE with 1-hour resolution are also detected by ABSR with only 4-hour resolution.

Organisms from cyanobacteria to humans have robust time-keeping mechanisms called biological clocks [

Circadian rhythms coordinate temporal regulation of other cellular processes. For example, the circadian clock regulates transcriptional activation of Wee1, a critical component in the cell cycle that coordinates timing of cell division [

A series of gene expression levels observed at a set of different time points is called a gene expression profile, and a rhythmic gene produces a rhythmic profile. In general, it is assumed that a rhythmic gene expression profile is correlated with rhythmic periodicity and hence each gene expression takes the form of a series of cosine curves:

In this paper, a new algorithm called the autoregressive Bayesian spectral regression (ABSR) is proposed. Built on ARSER, this ABSR algorithm significantly improves true discovery rate (TDR) and reduces FDR for noisy short time series as compared to JTK_CYCLE and ARSER. One of the features of ABSR comes from the use of posterior probabilities for model selection rather than the Akaike Information Criterion (AIC). In situations where the number of model parameters is large relative to the number of observations (e.g., the number of parameters is about one-half of the number of observations), AIC may fail to select the optimal model [

In Section

The proposed algorithm, the autoregressive Bayesian spectral regression (ABSR), is developed to identify rhythmic patterns in gene expression profiles. The procedure to obtain periodic information from time-course gene expression data is described below.

Suppose

Next, let the six sets of frequencies obtained by fitting the AR models of the

Lastly, each gene is classified according to the criteria described in Section

Flowchart of ABSR algorithm.

Model selection in ABSR proceeds by estimating the posterior probability of each harmonic model and then selecting the model with the largest posterior probability as the optimal model. To calculate a posterior probability, model (

In the absence of any reason to prefer one model over the others, it is reasonable to assume equal prior probability for each model; namely,

This integral cannot be simplified further but can be estimated by Monte Carlo method. The steps of model selection procedure are as follows:

Simulate each variance parameter according to its prior distribution.

Calculate the value of the likelihood function

Repeat steps (

Repeat steps (

Given the optimal model, as determined by maximizing the posterior probability, the following values can be calculated: the highest peak of the spectral densities, the

If the maximum value of the spectral densities is less than a preselected spectrum threshold (e.g., 10 or 5) or the dominant period is not significant

Otherwise, the gene is classified by the estimate of the dominant period. User-defined intervals for ultradian and circadian categories are used to classify the profiles. In particular, the rhythmic categories are defined as follows: if the estimated period is greater than or equal to 6 hours and strictly less than 10 hours, denoted by the time interval of

The value of the spectrum threshold needs to be selected. By calculating the number of rhythmic profiles for each value of the spectrum threshold in consideration (e.g., from 10 to 0 with step of 0.5), the correspondence of the number of rhythmic profiles and the value of the spectrum threshold can be studied, and the value of the spectrum threshold can be selected according to prior knowledge and research purpose. For example, if the research goal is to discover as many rhythmic genes as possible, then the value of threshold with the maximum number of rhythmic profiles can be selected. In the case of searching for a less conservative result, the spectrum of threshold can be selected to be the largest value that the number of rhythmic profiles does not change significantly as the threshold value reduces. If one can assume the data with less noise or is interested in conservative detection of rhythmic profiles, a large value of the threshold could be applied. For example, a threshold of 10 is used in the simulation studies.

To assess the performance of the ABSR algorithm, sequences of sinusoidal data to represent profiles with a length of 48 hours that consists of 4-, 2-, or 1-hour resolution are generated. Four periodic behaviors are considered: periods of 8 and 12 hours (ultradian rhythms), period of 24 hours (circadian rhythm), or aperiodic (arrhythmic profiles). It is noticed that gene expression profiles with a linear trend are common in the experimental data, so both patterns of cosine function with and without a linear trend (Table

Formula used to simulate data.

Pattern | Function |
---|---|

Noise | |

Cosine | |

Noise with linear trend | |

Cosine with linear trend | |

In order to describe the performance of the three algorithms, the following five terms are defined. A

The two ultradian datasets (with true period = 8 and 12) are denoted by ultradian8 and ultradian12, respectively, and the combined dataset of the 4,000 profiles from the four categories (arrhythmic, ultradian8, ultradian12, and circadian) is used to calculate TDRs and FDRs. To make this comparison reasonable, the window of period of 6 to 28 hours (8 to 28 hours for 4-hour resolution) is considered for JTK_CYCLE, and the period windows of 6 to 14 hours and 20 to 28 hours are considered for ARSER. Since ARSER and JTK_CYCLE estimate the period of gene expression but do not classify genes into rhythmic categories, comparing of the classification is done based on the period estimates and their significance. A profile is considered as ultradian8 if its period estimate is significant (

Table

Classification comparisons for fixed period data.

Resol. | Result category | True category | TDR | FDR | |||
---|---|---|---|---|---|---|---|

Arrhy. | Ultra.8 | Ultra.12 | Circa. | (%) | (%) | ||

| |||||||

Arrhy. | 818 | 1 | 75 | 29 | — | — | |

Ultra.8 | 72 | 999 | 0 | 0 | 99.9 | 6.7 | |

Ultra.12 | 76 | 0 | 925 | 0 | 92.5 | 7.6 | |

Circa. | 34 | 0 | 0 | 971 | 97.1 | 3.4 | |

| |||||||

Arrhy. | 0 | 0 | 0 | 0 | — | — | |

Ultra.8 | 236 | 493 | 0 | 0 | 49.3 | 32.4 | |

4 hr | Ultra.12 | 226 | 0 | 760 | 0 | 76.0 | 22.9 |

Circa. | 101 | 0 | 0 | 592 | 59.2 | 14.6 | |

Undef. | 437 | 507 | 240 | 408 | — | — | |

| |||||||

Arrhy. | 998 | 964 | 943 | 461 | — | — | |

Ultra.8 | 0 | 36 | 0 | 0 | 3.6 | 0.0 | |

Ultra.12 | 0 | 0 | 57 | 0 | 5.7 | 0.0 | |

Circa. | 2 | 0 | 0 | | 53.9 | 0.4 | |

| |||||||

| |||||||

Arrhy. | 993 | 0 | 1 | 0 | — | — | |

Ultra.8 | 6 | 1000 | 0 | 0 | 100.0 | 0.6 | |

Ultra.12 | 0 | 0 | 999 | 0 | 99.9 | 0.0 | |

Circa. | 1 | 0 | 0 | 1000 | 100.0 | 0.1 | |

| |||||||

Arrhy. | 67 | 0 | 0 | 0 | — | — | |

Ultra.8 | 241 | 487 | 0 | 0 | 48.7 | 33.1 | |

2 hr | Ultra.12 | 71 | 0 | 413 | 0 | 41.3 | 14.7 |

Circa. | 80 | 0 | 0 | 383 | 38.3 | 17.3 | |

Undef. | 541 | 513 | 587 | 617 | — | — | |

| |||||||

Arrhy. | 993 | 0 | 0 | 0 | — | — | |

Ultra.8 | 2 | 1000 | 0 | 0 | 100.0 | 0.2 | |

Ultra.12 | 3 | 0 | 1000 | 0 | 100.0 | 0.3 | |

Circa. | 2 | 0 | 0 | 1000 | 100.0 | 0.2 | |

| |||||||

| |||||||

Arrhy. | 1000 | 0 | 0 | 0 | — | — | |

Ultra.8 | 0 | 1000 | 0 | 0 | 100.0 | 0.0 | |

Ultra.12 | 0 | 0 | 1000 | 0 | 100.0 | 0.0 | |

Circa. | 0 | 0 | 0 | 1000 | 100.0 | 0.0 | |

| |||||||

Arrhy. | 201 | 0 | 0 | 0 | — | — | |

Ultra.8 | 139 | 496 | 0 | 0 | 49.6 | 21.9 | |

1 hr | Ultra.12 | 115 | 0 | 572 | 0 | 57.2 | 16.7 |

Circa. | 80 | 0 | 0 | 776 | 77.6 | 9.3 | |

Undef. | 465 | 504 | 428 | 224 | — | — | |

| |||||||

Arrhy. | 996 | 0 | 0 | 0 | — | — | |

Ultra.8 | 1 | 1000 | 0 | 0 | 100.0 | 0.1 | |

Ultra.12 | 1 | 0 | 1000 | 0 | 100.0 | 0.1 | |

Circa. | 2 | 0 | 0 | 1000 | 100.0 | 0.2 |

Among the 539 (bold in Table

In addition to the TDRs and FDRs of periods, Figure

Boxplots of period and amplitude estimate for data with different temporal resolutions. Three outliers (one in each rhythmic category by ABSR) are excluded from (a). One arrhythmic profile by ABSR is excluded from (c). Forty-five arrhythmic profiles by ABSR are excluded from (e). Those outliers represent infinitely large period estimates, which imply arrhythmic property.

Period, 4 hr

Amplitude, 4 hr

Period, 2 hr

Amplitude, 2 hr

Period, 1 hr

Amplitude, 1 hr

Although the period estimates by JTK_CYCLE with 4-hour resolution are shown to be less biased, the majority of the estimates are not statistically significant. On the other hand, ABSR results in significant period estimate for more than 90% of the rhythmic profiles with the bias of at most 0.55. Notice the circle above the JTK_CYCLE box of the ultradian12 profiles represents 63 ultradian12 profiles, while the bench of circles above and under the ABSR box represents 36 profiles. The standard error by ABSR is slightly greater than by JTK_CYCLE (2.64 versus 2.08). Since ARSER provides period estimate in diverse windows for a large portion of data, the standard errors by ARSER are much larger than by ABSR and JTK_CYCLE for various rhythmic categories with various resolutions.

Considering the amplitude estimate, ABSR performs better with less bias and smaller standard error than ARSER and JTK_CYCLE for all categories and temporal resolutions.

Figure

ROC plots for data with 4-hour resolution.

Ultradian8

Ultradian12

Circadian

In the above-mentioned simulation study, three fixed values of period, amplitude, phase, and signal/noise ratio are considered. To assess the performance of ABSR on more flexible parameter settings, two more simulation studies are performed. Since both ABSR and JTK_CYCLE provide one single periodicity estimate for one profile, comparison between ABSR and JTK_CYCLE only is performed. In the first simulation, 1000 extra profiles are generated with uniformly distributed periods, amplitudes, and phases. Periods are within 6 to 26 hours, amplitudes are within 1 to 6, and phases are within 0 to the corresponding period. Again the profiles are simulated for 48-hour course with 4-hour resolution. Standard normal errors are added to the sinusoidal waves. ABSR considers all positive values for the period estimate, but very large estimates are not of interest. Hence 58 profiles with very large period estimate (>35 hours) are removed, and comparison of period and amplitude estimates with JTK_CYCLE is done. By providing continuous period estimates, ABSR shows stronger linear correlation than JTK_CYCLE for both period and amplitude estimates (Figure

Period and amplitude estimate for randomized period. Fifty-eight outliers with huge period estimate by ABSR are excluded from (a). The green reference line is with the slope of 1.

Period

Amplitude

Besides period and amplitude estimates, the phase information is also an important aspect of rhythmicity. To clearly show the performance of phase estimate, in the second study, 500 ultradian profiles and 500 circadian profiles are simulated. The profiles are generated with sinusoidal pattern with the parameters uniformly distributed: period from 8 to 12 for ultradian profiles and from 22 to 26 for circadian profiles, amplitude from 1 to 6, linear slope from −0.1 to 0.1, and phase from 0 to the length of the cycle. Standard normal error is added to each profile. It is noticed that when the true phase is close to zero or the true period, both ABSR and JTK_CYCLE sometimes result in a noticeable bias in phase estimate. This may be caused by the low temporal resolution. By removing those profiles, it is found that the correlation coefficients are similar by ABSR and JTK_CYCLE for circadian profiles, but much higher by ABSR than by JTK_CYCLE for ultradian profiles (Figure

Phase estimate for randomized rhythmicity.

Ultradian

Circadian

The settings in the first study provide the broad testing of wide range of period and different ratios of the amplitude over noise, and the settings in the second study provide the broad testing of wide range of phase. It is found that ABSR performs well in both studies, so it can be used in diverse situations.

Though cosine wave is typically assumed, some experimental data exhibits nonsinusoidal pattern. So a good method should be able to detect the rhythms for nonsinusoidal patterns as well. The performance of ABSR to detect nonsinusoidal circadian rhythms when both ultradian and circadian rhythms are of interest is then assessed. Five different circadian (period = 24) patterns [

Detecting circadian rhythms for nonsinusoidal patterns.

Patterns

Number of detected circadian profiles

From the above-mentioned simulation studies, it is found that ABSR performs best among the three algorithms with low resolution (4-hour) by being highly sensitive in detecting rhythmic profiles with low FDR and produces period, amplitude, and phase estimates which are close to the true values independent of the temporal resolution. ABSR is capable of discovering harmonic ultradian and circadian profiles simultaneously, and the performance is not affected by the proportion of profiles with a linear trend. As the temporal resolution increases, ABSR and JTK_CYCLE perform better with respect to FDR and TDR, but JTK_CYCLE is more beneficial in high temporal resolution.

Hughes et al. [

Spectrum thresholds from 0 to 10 with increment of 0.5 are considered, and since the goal is to discover as many rhythmic genes as possible, the threshold of 2.5 is selected. Figure

Classification of rhythmic categories for the mouse liver data with 4-hour resolution.

Ultradian8

Ultradian12

Circadian

In addition, the three algorithms are applied to the original data, and the spectrum threshold of 1 is selected. Figure

Classification of rhythmic categories for the mouse liver data with 1-hour resolution.

Ultradian8

Ultradian12

Circadian

Comparison for discovered circadian profiles between ABSR with 4-hour resolution and JTK_CYCLE with 1-hour resolution.

To further understand the result, the linear trend in each profile for both temporal resolutions is examined. Figure

Histogram of linear slope for mouse liver data with different temporal resolution.

4-hour

1-hour

In this paper, we present a new algorithm, ABSR, to determine the rhythmicity of a gene expression profile with short time series. For noisy short time series (e.g., profiles within 48 hours with 4-hour resolution), ABSR performs well in estimating period and amplitude and substantially reducing the FDR of ARSER and increasing the TDR of ARSER and JTK_CYCLE. To apply the JTK_CYCLE algorithm, a user-defined window of period is required, and it is observed that different user-defined windows might obtain inconsistent estimates. However, there is no such constraint in ABSR, and the estimates are consistent even with sparse observing temporal resolution relative to the true period. Moreover, the single period estimate without a preset window enables ABSR to discover any harmonic and circadian rhythms simultaneously. Since ABSR manipulates the data to treat the linear trend and unwanted noise, ABSR can be applied to data with less consideration of the quality. Inheriting from ARSER, ABSR is also a joint strategy to analyze data through both frequency and time domains. Though experiments with duration of more days and high resolution may help us study the rhythms better, the cost and feasibility are not always realistic. Due to the cost of experiments, most of the time-course experiments designed to study rhythms are performed for 48 hours with 4-hour resolution. In this particular case, ABSR is a better choice, and, with the tunable thresholds, the trade-off can be small.

Since ABSR assumes continuous values for the period estimate, it can estimate any rhythms, not limited to ultradian or circadian rhythms. Estimating the period is the first step, and classification is the second step. If one is only interested in the first step, the classification step can be skipped.

In this study, the longest period in consideration is 24 hours, and the temporal resolution is focused on the typical 4-hour resolution, so the

The value of threshold for the spectral density may affect the classification results, so the choice of threshold is crucial. As a consequence of choosing a large threshold, the results could be conservative. In other words, some rhythmic profiles might not be detected, while the detected rhythmic profiles could be accepted with more confidence.

Since ABSR is a Bayesian algorithm, inevitably, the computing time is a concern. The likelihood functions are estimated independently across different profiles, so the data can be partitioned and the algorithm can run in parallel to increase the computational efficiency. Our computer is a workstation with technical specification as Intel Xeon E5-2687W (2 processors), 3.10 GHz, 256 GB RAM, Windows 7 Ultimate, and R version 3.1.2. The computation efficiency is tested with 4-, 2-, and 1-hour temporal resolutions within 48-hour time-course data. Running the algorithm with 30 threads in parallel, it is observed that, for one single thread, 3 to 4 profiles are analyzed per minute for the 4-hour resolution data, 2 to 3 profiles are analyzed per minute for the 2-hour resolution data, and about 2 profiles are analyzed per minute for the 1-hour resolution data.

Although ABSR performs best among the three algorithms for short noisy time series, it is not the best choice for all situations. For example, ABSR is useful for users who would like to maximize the discovery of rhythmic genes with 4-hour temporal resolution data. As the length of the time series increases, the number of parameters to be sampled in estimating the posterior probability also increases, so the convergence of the estimate could be a concern. In case of long time series, JTK_CYCLE would be a better choice to identify the classification of time-course gene expression profiles rather than ABSR. Therefore, users will need to choose an optimal algorithm based on their experimental conditions.

The authors declare that they have no competing interests.

The authors would like to thank Professor John Hogenesch for his comments and sharing synthetic and experimental data. This work was supported by the Defense Advanced Research Projects Agency (D12AP00005) and Charles Phelps Taft Research Center. Additional support was provided by Department of Mathematical Sciences at University of Cincinnati.