Temporal gene expression data are of particular interest to researchers as they contain rich information in characterization of gene function and have been widely used in biomedical studies. However, extracting information and identifying efficient treatment effects without loss of temporal information are still in problem. In this paper, we propose a method of classifying temporal gene expression curves in which individual expression trajectory is modeled as longitudinal data with changeable variance and covariance structure. The method, mainly based on generalized mixed model, is illustrated by a dense temporal gene expression data in bacteria. We aimed at evaluating gene effects and treatments. The power and time points of measurements are also characterized via the longitudinal mixed model. The results indicated that the proposed methodology is promising for the analysis of temporal gene expression data, and that it could be generally applicable to other high-throughput temporal gene expression analyses.
1. Introduction
The high-throughput gene expression
techniques, such as oligonucleotide and DNA microarray, serial analysis gene
expression (SAGE) make it possible now to quickly generate huge amount of time
series data on gene expression under various conditions [1–5],
and have been widely applied in biomedical studies. The current temporal gene
expressions usually have several main features: containing large scale of data
set, having many genes, involving many procedure noises, and absenting
statistical confidence, but few measuring time series levels. Using the
difference at two or very few time points to understand changes has also some
fundamental limitations. It tells us nothing about each gene's trajectory, and does
not consider “overall" difference, nor does it allow studying evolution
difference. For these such data with observations at very few time points, the
current widely used analysis methods are various clustering methods, fold expression changes,
ANOVA [6–9], and recently the hidden
Markov chain models (Yuan and Kendziorski 2006). It is simple to interpret the results, and all
the available data are analyzed when these methods are applied. However, there
are problems associated with these methods which include merely qualifying
characteristics of the gene behaviors and clearly absenting quantitative
description, and it may take a risk of having false positive and false negative
when looking strictly at fold change [9, 10]. Some genetic information may
be lost using fold change analysis, and difficulties arise when genes having a
bigger folds change in one expression experiment have different performance in
multiple arrays or different experiments. It is even more problematic when
multiple testing was carried out. For the widely used ANOVA or univariate
method, it only analyzes difference between observed means and treats changes
of individual gene profile as noise. The main limitation is that the data must
be balanced, that is, all measurements occur at same times for all genes, no
distinction between unequally spaced time points and equally spaced time
points. The ANOVA does not produce a parameter that evaluates the rate of
change over time for different treatment groups. Besides, it provides an
oversimplification representation for the mean of a data set. The generalized
linear models are also used in analyzing gene expression data, but they are
based on analyzing the data at each time point separately. They do not take
into account the fact that the gene expression measurements are not independent
and do not address the difference in how the mean changes over time. Both the
“classical” univariate and multivariate procedures assume that covariance
matrix of each data is the same for all measurements at different times,
regardless of group or compound symmetry. This assumption implies a very
pattern of correlation among observations taken on the same unit at different
times which is quite unrealistic for longitudinal data [11]. The other characteristic
shared both by the classical univariate and multivariate methods is that time
itself does not appear explicitly in the model.
By characterizing
the entire pattern of gene expression, and distinguishing the individual gene
profile changes subgroup and population-average profile changes, precise
estimates with good capability and excellent combination of gene and condition
effects were achieved with observations at much more time points. A prospective
cohort study where repeated measures are taken over time for each gene is usually
designed to answer the following two questions. First, how many observation
points are needed over time? Second, how are the variables of interest
including genes and conditions associated with each other over time? Therefore,
the longitudinal observations with enough time points are most appropriate for
the investigation of individual gene changes over time and for the study of
effects of other factors such as experimental conditions. In this paper, we
illustrate the strategy with an example of a 15-gene set in Pseudomonas aeruginosa
expressed in three
conditions and measured at 48 time points. These 15 genes are either quorum-sensing
(QS) genes or quorum sensing regulated genes. Quorum sensing system is a
bacterial gene regulatory system that employs small secreted molecules called
autoinducers as signaling molecules to coordinate gene expression in a
population manner. The autoinducers synthesized and diffused into the growth
medium by individual cells increase in amount when the cell number increases,
and when the concentration of autoinducers reaches a threshold they bind to
cognate transcription regulator to modulate transcriptions of the bacterial
genes. So the cell behaves as a whole. The quorum sensing systems in P.
aeruginosa play a central role in regulating virulence factor expression
and in biofilms formation. It has been reported that the expression of one of
the genes in QS systems, rhlI is regulated by the iron conditions of the growth
medium. However, the extent that this gene is regulated by iron availability is
rather small. It is hard to assess the importance of this effect of iron on the
QS system in P. aeruginosa. Employing the strategy described in this
paper, we are able to determine the definite effect of iron availability using
a relative large dataset which includes 15 genes over 48 time points in three different
conditions totaling 2160 data points.
To analyze such
data of temporal gene expressions, the longitudinal mixed model is used. The linear mixed models are extensions of
linear regression models used to analyze longitudinal (correlated) data. They
accommodate both fixed effects and random effects where the random effects are
used to model between-gene variation and the correlation induced by this
variation. Linear mixed models are extremely flexible analysis tools, which are
especially suitable for unbalanced data with unequally spaced time points and of
emphasis on both individual gene level and population-level components. The
longitudinal mixed model analysis we present provides a strategy to analyze
more complex time series gene expression datasets. The gene expression
longitudinal data is characterized by repeated observations over time on the
same set of genes, and the repeated observations on the same gene tend to be
correlated, therefore, any appropriate statistical analysis must take this
correlation into account. The longitudinal mixed model analysis is useful to
identify general trends within genes over time, to detect nonlinear changes
over time, and also to provide information about the amount of interindividual
gene variability. This analysis incorporates different subgroups on the same
graph to explain interindividual gene variability.
2. Materials and Methods2.1. Gene Expression Data in P. Aeruginosa
The promoter
regions of selected P. aeruginosa virulence factors were amplified by
PCR using oligonucleotide primers synthesized [12] according to the PAO1 genome
data and PAO1 chromosomal DNA as the template. The PCR amplified promoter
regions were then cloned into the XhoI-BamHI sites of pMS402 and
transformed into PAO1 by electroporation. PCR and DNA manipulation and
transformation were performed following general procedures. The
promoterless luxCDABE operon in pMS402 enables the activity of the
promoter fused upstream of the operon to be measured as counts per second (CPS) of light production
in a Victor2 multilabelcounter [12].
TSBDC minimal
medium supplemented with EDTA (400 ug/mL) and 50 ug/mL FeCl3 was used in gene expression
assays (Table 1). Overnight cultures of the reporter strains were diluted 1:200
in a 96-well microtiter plate and the promoteractivity of the
virulence factors in different conditions was measured every 30 minutes for 24
hours. Bacterial growth was monitored at the same time by measuring the optical
density at 620 nm (OD620) in the Victor2 multilabel
counter.
Culture media.
Condition treatments
Description
C1T13
TSBDC
C2T13
TSBDC + 400 ug/mL
EDDA
C3T13
TSBDC + 50 ug/mL FeCl3
2.2. Statistical Methods
To analyze these
longitudinal data of temporal gene expressions, the mixed model Yi=Xiβ+Zibi+εi will be used, where Yi is an (ni×1) vector of expression for the ith gene, i=1,…,m. Xi is an (ni×p) design matrix that characterizes the
systematic part of the gene expression, for example, depending on covariates
and time. β is a (p×1) vector of parameters usually referred to as fixed
effects, that complete the characterization of the systematic part of the gene
expression. Zi is an (ni×k) design matrix that characterizes random
variation in the response attributable to among genes. bi is a (k×1) vector of the random effects variables
that completes the characterization of the among-gene variation. εi is an (ni×1) vector of within-gene errors
characterizing variation due to the way in which the expression levels are
measured on the ith gene.
The data vector Yi
has a multivariate normal distribution
with E(Yi)=Xiβ,var(Yi)=ZiDZi′+Ri=Σi,
and Yi~N(Xiβ,Σi).
Here, the usual assumptions are bi~N(0,D),
D is a (k×k) covariance matrix that characterizes variation
due to among-gene source, and the dimension of D corresponds to the number of
among-gene random effects in model. εi~N(0,Ri), Ri is an (ni×ni) covariance matrix that characterizes variance
and correlation due to within-gene sources. The form of Σi implied by the model has two distinct
components, the first having to do with variation solely from among-gene
sources and the second having to do with variation solely from within-gene
sources. We used maximum likelihood (ML), restricted maximum likelihood (REML),
and minimum variance quadratic unbiased estimation (MIVQUE0) to estimate the
covariance parameters of the G and R, respectively.
In order to check
the influence of temporal measurements for longitudinal mixed analysis, we further
constructed a dataset of the same dimension and with the same covariates and
factor values for which power is to be calculated. With F-test statistics, we
calculated noncentrality parameter (ϕ) and degrees of freedom ν1 and ν2,
then power is calculated as P(Fν1,ν2,0>FC), FC is a critical value. All analyses were
implemented by SAS package.
3. Results and
Analysis3.1. The Trajectories of the Longitudinal Gene Expression Data
To validate the
models for our data set, we plotted the expression profiles for all genes under
different conditions. The trajectories of the 15-gene set are shown in Figure 1. From the figure, we can see that there is
high degree of variations between genes. There are also correlation genes at
different time points, and the correlation structure cannot be ignored in
analysis. The expression trajectories of the genes change over time for all of
the genes, and at a certain time point, the change rate for each gene is
different from other time point and from that of other genes. From Figure 2, we can see that the
trajectories of experimental treats are also changing over time, and the change
rate varies from conditions.
The trajectories of the 15 gene-set in C1T13: TSBDC condition.
The trajectories of one gene in 3 conditions. Control: TSBDC, Condition A: TSBDC + 50 ug/mL FeCl3 Condition B: TSBDC + 400 ug/mL EDDA.
3.2. Choice of and Assessing the Goodness-of-Fit Covariance Structure
In the
longitudinal data, there are three sources of error in the residual, including
serial correlation, measurement error,
and random component. In order to use longitudinal mixed-model methodology, it
is assumed that the data has a linear mean and a reasonable covariance structure. The reasonable covariance
structure is a parsimonious covariance just enough to be estimated with
available current data and yet rich to capture probable covariance between gene
expression observations. The fitting information shown in Table 2 provides some
statistics about the estimated mixed model. The log likelihood supplies the
estimation information of covariance G and R in the mixed models. Akaike’s information criteria (AIC) can be used to compare
models with the same fixed effects but different variance structures. Models
having the smallest AIC are deemed the best. The Schwarz Bayesian criteria (BIC)
are also computed, and models with smaller BIC are also preferred. The six
models with different covariance structure were fitted, and preference was
selected based on the AIC and BIC values. Inspection of AIC and BIC values for
each of the six models revealed that the values of both the AIC and BIC in the
assumed same covariance structure are larger than those of the assumed
different ones. Both criteria are the smallest for the chosen separate spatial
power (SP) structures for each treatment. The values of both AIC and BIC in SP are the minimum among the
models. The log likelihood of the model is also the best for separate SP
structures. As both criteria agree, it would be sensible to choose the model to
represent the covariance structure that has different variance and covariance
in different treatments. Interestingly, we found there were the almost same
values AIC, BIC, and likelihood value between GLM and CS model, which indicated
that univariate GLM calculations are identifical to MIXED estimates when using
CS for the balanced data sets. The multivariate GLM cannot determine best fit
when the data set is a longitudinal data.
Covariance structures using ML.
Model
Description
AIC
BIC
−2log likelihood
1
General
linear model (GLM)
1811.8
1856.2
1798.8
2
Compound
symmetry (CS)
1811.5
1856.0
1796.7
3
Variance
components (VC)
1665.0
1651.3
1645.0
4
Heterogeneous
CS (CSH)
1636.8
1618.0
1610.8
5
Spatial
power (SP)
1689.2
1685.6
1600.2
AIC:
Akaike’s information criteria; BIC: Bayesian information criteria for each
model selected.
3.3. Power and Sample Size Determination for Longitudinal Mixed Model
In statistical analysis,
one typically expresses the belief that some effects exist in a population by specifying an
alternative hypothesis to H1,
a null hypothesis H0 as the assertion that effect does not exist
and attempt to gather evidence to reject H0 in favor of H1.
If H0 is rejected but there is really no effect,
this is called a Type I error, which is usually designated α;
if there really is an effect in the population but H0 is not rejected, then a Type II error has been
made, which is usually designated β.
The probability 1−β of avoiding a Type II error, that is,
correctly rejecting H0 and achieving statistical significance, is
called the power. We simulated our data structure and calculated the power of
estimating condition effects via the longitudinal mixed model. As shown in
Figure 3, we found the model can get maximum power while more than 7 or 8
measurements were taken. So the 48 temporal measurements of each gene in our
research could have enough power to obtain the estimation of treatments and
gene effects.
Power analysis under the longitudinal mixed model with heterogeneous
compound symmetry variance structure.
3.4. Estimation of the Effect of Iron Condition on QS Genes by the Mixed Model
We adopted the longitudinal mixed model with heterogeneous compound symmetry variance to estimate
the effects of iron condition on QS genes expression. From Figure 2, the
effects of the culture media TDBDC and TSBDC + 400 ug/mL EDDA are almost equal and
higher than that of TDBDC + FeCl3. Comparing with the TSBDC, the
addition of TSBDC + 50 ug/mL FeCl3 positively regulates the expression of these
genes as shown in Figure 4. To check the detailed differences of the genes, the
longitudinal mixed model was used to estimate the gene effects, as shown in
Figure 5. We found that most of genes, including FliC, LasR, PKD202, PKD203,
and PhlR, demonstrate positive expression effects in condition of addition of
400 ug/mL EDDA, whereas PhlA shows opposite expression effect.
The estimation of condition effects. Condition A: TSBDC + 50 ug/mL FeCl3,Condition B: TSBDC + 400 ug/mL EDDA.
The estimation of gene effects under condition TSBDC + 50 ug/mL FeCl3.
4. Discussion
The
identification of genes that show changes in expression between varying
biological conditions is a frequent goal in microarray experiments. Under
different
biological conditions, the patterns of gene expressions
may be various. To obtain efficient information for temporal gene expression,
the number of longitudinal observations should be enough for individual gene
changes over time and the study of effects with biological conditions.
In
longitudinal studies, time effect is the changes over time for each gene, and
cohort effect is the 22 differences among genes in their baseline values.
Longitudinal studies can distinguish these time and cohort effects while
cross-sectional studies cannot. In this paper, we have considered mixed model
with longitudinal covariates, the analysis of longitudinal data should take
into account firstly, the within-subject
correlation, secondly the measurements taken at unequal time intervals and finally the missing
observations. Repeated measures analysis of variance can be used to analyze
longitudinal or repeated measures data for balanced study design, that is, when all genes are measured at equal time
points and there are no missing data. In large scale of gene expression
analysis, if having unbalanced datasets in longitudinal studies, it is
necessary to use some alternative techniques which can handle unbalanced data.
In this research, we confirmed that univariate GLM calculations are identifical
to MIXED estimates when using CS for
balanced data sets. The multivariate GLM cannot determine best fit when the
data set is a longitudinal data. Therefore, the procedures of best fit mixed
model include: (1) the choice of the
model, (2) the choice of the variance-covariance structure (specifying
the working correlation structure for
each gene, e.g., independence, exchangeable, stationary, and autoregressive), (3) assessing the
goodness-of-fit of the model, and (4) assessing the goodness-of-fit of the variance covariance
structure.
Although the paper only analyzed
the effects of three treatments and 15-gene effects, it proved that the longitudinal mixed model is
a feasible method in dense temporal gene expression analysis. We found
that the addition of TSBDC + 50 ug/mL FeCl3 positively regulates the expression
of these genes in our analysis. It has been reported that iron availability in
the growth condition affects the expression of genes. However, the changes of expression are rather
small. It is thus difficult to assess whether there is a pronounced effect of
iron on the QS genes. Accordingly the
current analysis method, using the mixed model described aforementioned a
definite effect could be determined. A
comprehensive understanding of biological processes requires the acquisition of
expression data at different developmental stages, in different tissues and
different treatment conditions with different organisms. The addition of time
as a variable allows observation of the modulation of gene expression whether
due to the regulation of development or the changing impact of a treatment
condition. The expectation is that high-throughput gene expression analysis
conducted in the higher dimensions of genes, conditions, tissues, and time as
variables will help elucidate what the genes do, when, where, and how they are
expressed as elements of an orchestrated system under the effects of
perturbations and developmental processes, and we will explore the possibility
of generalized mixed model in higher dimensions expression data [13–16].
YuanM.myuan@isye.gatech.eduLinY.yilin@stat.wisc.eduModel selection and estimation in the Gaussian graphical model2007941193510.1093/biomet/asm018ZhuH.hua-zhu@omrf.ouhsc.eduTangY.yuhong-tang@omrf.ouhsc.eduIvanciuL.ivanciul@omrf.ouhsc.eduTemporal dynamics of gene expression in the lung in a baboon model of E. coli sepsis20078, article 5812310.1186/1471-2164-8-58ChoR. J.CampbellM. J.campbell@mag.comWinzelerE. A.A genome-wide transcriptional analysis of the mitotic cell cycle199821657310.1016/S1097-2765(00)80114-8SpellmanP. T.SherlockG.ZhangM. Q.Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization199891232733297BjarnasonJ.SouthwardC. M.SuretteM. G.surette@ucalgary.caGenomic profiling of iron-responsive genes in Salmonella enterica serovar Typhimurium by high-throughput screening of a random promoter library2003185164973498210.1128/JB.185.16.4973-4982.2003EisenM. B.SpellmanP. T.BrownP. O.BotsteinD.botstein@genome.stanford.eduCluster analysis and display of genome-wide expression patterns19989525148631486810.1073/pnas.95.25.14863YeungK. Y.RuzzoW. L.Principal component analysis for clustering gene expression data200117976377410.1093/bioinformatics/17.9.763LiH.LuanY.HongF.LiY.Statistical methods for analysis of time course gene expression data20027a90a9810.2741/liDraghiciS.sod@cs.wayne.eduKulaevaO.HoffB.PetrovA.ShamsS.TainskyM. A.Noise sampling method: an ANOVA approach allowing robust selection of differentially regulated genes measured by DNA microarrays200319111348135910.1093/bioinformatics/btg165TanakaT. S.JaradatS. A.LimM. K.Genome-wide expression profiling of mid-gestation placenta and embryo using a 15,000 mouse developmental cDNA microarray200097169127913210.1073/pnas.97.16.9127ZegerS. L.LiangK.-Y.An overview of methods for the analysis of longitudinal data19921114-151825183910.1002/sim.4780111406DuanK.DammelC.SteinJ.RabinH.SuretteM. G.surette@ucalgary.caModulation of Pseudomonas aeruginosa gene expression by host microflora through interspecies communication20035051477149110.1046/j.1365-2958.2003.03803.xSongJ. Z.DuanK. M.WareT.SuretteM.The wavelet-based cluster analysis for temporal gene expression data2007200773938210.1155/2007/39382WesterhoffH. V.MosekildeE.NoeC. R.ClemensenA. M.am.clemensen@fysik.dtu.dkIntegrating systems approaches into pharmaceutical sciences2008351-21410.1016/j.ejps.2008.05.011WesterhoffH. V.hans.westerhoff@manchester.ac.ukKolodkinA.ConradieR.Systems biology towards life in silico: mathematics of the control of living cells2009581-273410.1007/s00285-008-0160-8AndroulakisI. P.yannis@rci.rutgers.eduYangE.eyang@eden.rutgers.eduAlmonR. R.almon@eng.buffalo.eduAnalysis of time-series gene expression data: methods, challenges, and opportunities2007920522810.1146/annurev.bioeng.9.060906.151904