Abstract

Organisms usually cope with change in the environment by altering the dynamic trajectory of gene expression to adjust the complement of active proteins. The identification of particular sets of genes whose expression is adaptive in response to environmental changes helps to understand the mechanistic base of gene–environment interactions essential for organismic development. We describe a computational framework for clustering the dynamics of gene expression in distinct environments through Gaussian mixture fitting to the expression data measured at a set of discrete time points. We outline a number of quantitative testable hypotheses about the patterns of dynamic gene expression in changing environments and gene–environment interactions causing developmental differentiation. The future directions of gene clustering in terms of incorporations of the latest biological discoveries and statistical innovations are discussed. We provide a set of computational tools that are applicable to modeling and analysis of dynamic gene expression data measured in multiple environments.

INTRODUCTION

The development of high-throughput technologies, such as DNA microarrays and proteomics platforms, has made it possible to ask and address many fundamental but difficult questions in developmental biology and biomedicine. For example, variation in the pattern of gene expression may reflect unique physiological or pathological properties of individual cells, organs or organisms that cannot be observed readily or directly [1, 2]. By screening approximately 1000 proteins in individual cancer cells, Cohen et al. [3] detected a subset of proteins whose expression displays different dynamic patterns between seemingly identical cancer cells that actually have different fates. To identify distinct patterns of gene expression dynamics from a flood of microarray data, powerful computational tools for clustering genes or proteins based on their dynamic profiles have become essential. In the past decade, enormous efforts have been made to develop computational methods for cataloguing gene expression dynamics and use these distinct patterns to assess developmental functions and mechanisms of biological phenomena [4–15]. There is also a pressing need for computational approaches to cluster analysis of dynamic gene expression that interacts with the environment, because the activation and expression of many genes is environment-contingent [16]. However, to our best knowledge, such approaches have not been systematically described in the literature, obstructing geneticists to analyze time series data in a way that can detect complex effects on expression patterns.

In biology, there is a widespread phenomenon that organisms cope with biotic and abiotic environments by controlling gene expression to harness the complement of active proteins [17–19]. When the environment alternates between discrete states, organisms will stimulate their regulatory system through adjusting gene transcription rates to best adapt to the environment. For this reason, by detecting the difference in the pattern of gene expression trajectories between discrete environments, we will be in a better position to study interactions between genes and environments and dictate a comprehensive map of gene–environment relationships. Traditional approaches for studying gene–environment interactions are based on quantitative trait locus (QTL) mapping usually with experimental crosses. Significant gene–environment interactions are identified if specific QTLs are detected, through statistical tests, to display different effects between environments [20, 21]. More recently, gene transcript abundance has been used to study gene–environment interactions in many organisms such as yeast [16, 22] and worms [23], but all these studies are limited to gene expression data measured at single time points of development.

The purpose of this article is to describe a general framework for identifying environment-specific clusters of gene expression. The use of gene expression dynamics to understand gene–environment interactions is highly informative because of its capacity to identify development-related genes. However, this is a challenging work in terms of gene clustering across two-dimensional spaces, time and environmental state. The framework described in this article integrates developmental and environment-dependent programs of gene expression. Mathematical aspects of gene expression dynamics are implemented into a mixture model setting by considering the impact of environment on gene expression. The patterns of gene expression related to specific physiological functions can be parsimoniously modeled using a set of mathematical parameters [14, 24]. Thus, by estimating the parameters that determine mathematical functions, the pattern of how genes change their level of expression over time and environment can be estimated and tested. The results from these models, therefore, can better be interpreted in a biologically sensible way. In addition, the framework considers the intrinsic structure of time-dependent correlations based on an optimal statistical process, which increases the power of detecting significantly differentiated patterns. To demonstrate its usefulness and utilization in practice, we use this framework to analyze a real data set from a rabbit hemodynamic study in which gene expression is observed in two distinct blood flow environments [25, 26]. We evaluated the advantages of this tool by performing simulation studies. The simulation results illustrated that the tool has favorable statistical properties and can be used in any environment-dependent gene expression data.

GENE-CLUSTERING FRAMEWORK

Statistical model

Suppose there are n genes each measured at T time points in L environments. Let formula denote the gene expression data for gene i in environment l. Combining all the environments, we have formula. If these genes are grouped into J clusters, it means that any one of genes (i ) is assumed to arise from one (and only one) of the J possible clusters. Thus, the phenotypic value of gene i expressed at time formula in environment l is written as
(1)
where formula is an indicator for gene i, defined as 1 if this gene belongs to cluster j and 0 otherwise, formula is the mean of all genes belonging to cluster j at at time formula in environment l, formula is the value of covariate c (formula) for gene i, formula is the effect of covariate c, and formula is the residual assumed to follow a Gaussian distribution with mean zero and variance formula. For longitudinal data, residual errors at different time points may be correlated with covariance formula (formulaformula At least one of the inequalities, formulaformula holds). The residual variances and covariance comprise a (formula) covariance matrix formula.
The distribution of gene expression data is expressed as the J-component mixture probability density function, i.e.
(2)
where formula is a vector of mixture proportions which are nonnegative and sum to unity; formula contains the mean vector of cluster j; and formula contains residual variances and covariances among T time points over L environments which are common for all clusters. The probability density function of cluster j, formula, is assumed to be multivariate normally distributed with TL-dimensional mean vector
(3)
and covariance matrix formula. Notice that formula contains gene-specific covariate effects.
The likelihood based on a mixture model containing J clusters can be written as
(4)
where formula is a vector of unknown parameters including the mixture proportions, cluster-specific mean vectors and covariance.

Different from traditional treatments, we will incorporate mathematical and statistical models to fit the mean-covariance structures. Instead of estimating all elements in the vectors and covariance, we estimate the mathematical and statistical parameters that model the mean-covariance structures. Thus, the question of clustering gene expression dynamics becomes how to find a set of parameters (arrayed in formula) that models cluster-specific expression profiles in a biologically and statistically meaningful way and to find a set of parameters (arrayed in formula) that models the covariance structure both parsimoniously and flexibly.

Structural modeling of mean vectors

Since the transcript levels of DNA microarrays generally vary in a time course, we may use mathematical and statistical models to approach their dynamic changes. Below is a list of approaches for modeling time-dependent gene expression profiles:

Parametric modeling

For clock gene cases, the amount of mRNAs within the cell division cycle changes periodically, coincident with the cell cycle, which helps to maintain proper order during cell division or to conserve limited resources. The oscillation of cell cycle-regulated genes can be mathematically described by periodic Fourier functions or other functions [27–32]. The Fourier function can be approximated by its first K terms, expressed as
The coefficients formula and formula determine the times at which the expression level achieves maximums and minimums, formula is the average expression level of the gene and formula specifies the periodicity of the regulation. From Equation (3), the mean expression value of gene cluster j at time formula in environment l is expressed as
where formula denotes the vector of Fourier parameters of the first K orders. Thus, by estimating the parameters that define the periodic curves for individual clusters, we can determine the differences in the temporal pattern of gene expression [14].

There are many other biologically well-justified curves that can be used to model gene expression dynamics. These include sigmoid equations for gene expression related to biological growth [33–37], triple-logistic equations for gene expression related to human body growth [38], bi-exponential equations for gene expression related to HIV dynamics [39], sigmoid Emax models for gene expression related to pharmacodynamic response [40], biological thermal dynamics [41], aerodynamic power curves for gene expression related to bird flight [42, 43], hyperbolic curves for gene expression related to photosynthetic reaction [44], etc.

Nonparametric modeling

If time-varying expression of genes does not obey an explicit mathematical function, nonparametric approaches, such as kernel estimators or B-splines, can be used [9, 45, 46]. Kernel estimators are based on local polynomial regression, whereas smoothing splines use a piece-wise polynomial function. As shown by Silverman [47], kernel smoothing and smoothing-spline smoothing are asymptotically equivalent for independent data and splines are higher-order kernels. More recently, Legendre orthogonal polynomials (LOP) have been used to model dynamic changes of complex traits that cannot be explicitly described by a mathematical equation [48–51]. Since the LOP are orthogonal to each other and integrate to 0 in the interval [−1, 1], nonparametric estimates derived from this approach display favorable asymptotic properties [52, 53]. The LOP have been successfully used to model time-varying phenotypic or genetic changes for many complex traits, such as milk production [54] and plant height growth [48, 49]. As will be seen from an example below, it should be equally useful for modeling the dynamic pattern of gene expression profiles in a time course.

Semiparametric modeling

If gene expression spans multiple distinct stages (see [13]), at some of which the expression values follow a parametric form but at others of which they do not, we can implement a semiparametric model that combines the parsimony and biological relevance of parametric approaches with the flexibility of nonparametric approaches. As explained by Cui et al. [49], such a semiparametric approach was used to model the growth process and death process of tiller number in a lifetime of rice. A similar semiparametric approach can also be implemented in the clustering framework of multistage gene expression dynamics. This will enable us to study dynamic changes of gene expression by relating its temporal profiles from different developmental stages.

Structural modeling of covariance

Unstructured estimates of a longitudinal covariance matrix may be highly unstable for large matrices. This, in conjunction with the fact that the covariance among repeated measures over time has an inherent structure [55], implies that structuring a covariance matrix with few parameters may be crucial for parsimonious and efficient parameter estimates in dynamic gene clustering. An extreme of covariance structuring is compound symmetry and autoregression of order one, but this may be far from the true covariance, leading to severe bias. The best covariance estimator should be at the balance between its variance and bias. Below, we list several commonly used approaches for covariance structure.

Autoregressive moving-average process(p,q)

The autoregressive moving-average process, ARMA( p, q) [56], is flexible to provide a robust estimate of gene expression covariance structure [38]. The zero-mean residual error formula in environment l (1) is generated according to the following process
(5)
where formula and formula are unknown parameters and formula is a sequence of independent and identically distributed normal random variables with zero mean and variance formula. The ARMA(p,q) model parameters are arrayed in formulaformula. The merit of the ARMA model includes the existence of closed forms for the estimates of the inverse and determinant of the structured covariance matrix [57, 58], which enhances computational efficiency.

By various constraints, the ARMA model can be reduced to a simple autoregressive (AR) model and structured antedependence (SAD) model [59]. Both the first-order AR and first-order SAD models use only two parameters, but the latter is more flexible than the former since the latter allows the variance and correlation to change over time. The SAD model has been successfully incorporated in functional mapping of dynamic complex traits [60]. For the ARMA model, it is important to determine its optimal order to model covariance structure. A model selection procedure based on penalized likelihood criteria, such as AIC and BIC, can be established to determine an optimal approach.

Kernel smoothing

The kernel smoothing method has been used to estimate longitudinal covariances [61]. The advantage of this method lies in its flexibility to specify any form of covariances and asymptotic properties. Under the homogeneous assumption, the covariance of gene expression between any two time points formula and formula in environment l is written as a function of time interval, i.e. Covformula. Kernel smoothing describes this covariance by
(6)
where n is the total number of genes, T is the total number of time points, formula is a kernel function, h is a bandwidth and formula is the mean of gene i, given by formula. One of the mostly used kernel functions is Gaussian kernel, i.e. formula. A variety of statistical methods have been developed to choose an optimal kernel and appropriate bandwidth (see [61]).

Modeling covariance over time and environment

In this article, we consider gene expression dynamics in multiple environments. The approaches described above are used to model time-dependent covariances of gene expression separately for each environment. These approaches can be similarly used to model covariances over different environments at each time point. The covariance structure over time and environment is then modeled by taking the product of purely temporal and environmental covariances. This so-called separable approach is simple but has many undesirable properties since it does not allow environment–time interactions. We can implement a nonseparable stationary model (see [62–64]) to structure time-environment covariance of gene expression. A nonseparable covariance is not expressed as a Kronecker product of two matrices like separable structures. An advantage of covariance modeling in this context is in providing a better characterization of the random process to obtain optimal kriging or prediction of unobserved portions of it. In functional mapping of dynamic complex traits, Yap et al. [65] has successfully incorporated Cressie and Huang's [62] nonseparable model to estimate the covariance of photosynthetic rate over temperature and irradiance. A similar idea can be developed to use this nonseparable model to fit the spatiotemporal covariance of gene expression.

Estimation and tests

A hybrid EM-simplex algorithm was implemented to estimate the parameters, formula, contained in the likelihood (4). The EM algorithm provides a platform for estimating the proportions of different gene clusters, within which the simplex algorithm is embedded to estimate base vectors for each cluster and the covariance-structuring parameters. This can be described as follows:

In the E step, we define and estimate the posterior probabilities of gene i, with which it belongs to a particular expression pattern j, by
(7)
In the M step, the proportion of expression pattern j is calculated by
(8)
Mean-covariance structuring parameters in formula and formula are estimated in this step, but no closed forms can be derived for their estimators. The EM algorithm was derived to estimate these parameters when gene expression dynamics is modeled by Fourier series approximation and the covariance modeled by the ARMA process [14, 38]. In general, the simplex algorithm, which does not depend on explicit equations, can be implemented to estimate these structuring parameters.
The framework for clustering gene expression dynamics over multiple environments allows geneticists to address many biologically meaningful questions. First, an optimal number of gene clusters in terms of their different expression dynamics over all environments can be determined using AIC or BIC approaches (see the example shown below). Second, we need to determine the optimal number of gene clusters in a specific environment. For two given patterns j and j', they may be identical in an environment l, although different for all the L environments. This can be tested by
(9)
The H0 states that gene expression over an entire time course is identical for the two gene clusters j and j ′. If the H1 is rejected, it means that the optimal number of clusters in environment l is equal to, or less than, formula. By performing this pairwise test for all gene clusters, this approach allows the identification of the optimal number of expression patterns for environment l.
Third, we can test the significance of gene–environment interactions. This can be done by testing
(10)
The H0 states that the gene expression of a cluster over an entire time course is identical between the two environments l and l ′. If the H0 is rejected, it means that the expression of gene cluster j displays significant gene–environment interactions. This test provides a quantitative way to study the relationship between genes and environments.
Fourth, we are interested in testing how genes interact with time in a specific environment. For two given clusters j and j′, this test is expressed as
(11)
where c is a constant. The H0 states that time-dependent gene expression of cluster j ′ can be expressed as a linear function of that of cluster j in environment l. If the H0 is rejected, it means that this pair of gene clusters has a significant gene–time interaction under this particular environment.

WORKED EXAMPLE

Data analysis

The new tool is demonstrated by analyzing a data set of microarray genes associated with response to vein bypass grafting designed to treat arterial occlusive disease. The data were obtained using a rabbit bilateral vein graft construct, as previously described by Jiang et al. [25]. New Zealand White rabbits (weighing 3.0–3.5 kg) were treated by bilateral jugular vein interposition grafting and unilateral distal carotid artery branch ligation to create two distinct flows, i.e. two different environments. Through ligation of the internal carotid and three of the four primary branches of the external carotid artery, an immediate 6-fold difference in blood flow between the right and left vein grafts was obtained. A segment of the vein was retained at the time of implantation for baseline morphometric measurements. Vein grafts were harvested at 1, 3, 7, 14, 28, 90 and 180 days after implantation. Expression of 14 958 genes was recorded for each of these time points under both treatments, high flow and low flow. Other parameters related to hemodynamic behavior, such as graft flow rate, intraluminal pressure, mean circumferential wall stress and shear stress, were also measured or estimated.

An initial step is the selection of an appropriate model that fits the dynamic change of gene expression over time. Figure 1 shows the plotting of 10 randomly selected genes expressed over time in the two treatments, from which we found that there is a great variability in gene expression trajectories. Some are curvaceous, while others are quite flat. Thus, we used a flexible nonparametric approach based on LOP to model gene expression dynamics. Let formula denote a family of LOP with a particular order r derived from a special differential equation, where formula is a scaled time with a range formula. Let formula denote a vector of base values for cluster j in environment l. Then, time-varying mean values for cluster j in environment l in Equation (2) can be expressed as a linear combination of formula weighted by the family of LOP, i.e.
Our task now is to estimate the base vector formula from the given data. The variance of expression among genes seems to be broadly consistent over time points, suggesting that the first-order AR model may fit the data. To combine the expression data from the two flows, we used a separable model to structure the covariance over time and environment. In Figure 2, a plot of BIC values is illustrated against varying numbers of gene clusters under different LOP orders, from which we chose eight clusters and three orders that provide a best combination for curve fitting. Implementing this combination, we estimated the expression trajectories of each of the eight gene clusters for both high and low flows. We need to detect if any of these eight clusters, labeled from A to H (see Supplementary Figure S1), overlap in a flow. Pairwise tests using hypothesis test (9) indicate that no pair of clusters is identical for expression trajectories in each flow (P < 0.0001). This result suggests that the optimal number of clusters should be eight for both flows. Table 1 gives the estimated proportions and their standard errors of each cluster from these genes.
Trajectories of expression for ten genes randomly chosen from those associated with response to vein bypass grafting in rabbits in two treatments, high flow and low flow.
Figure 1:

Trajectories of expression for ten genes randomly chosen from those associated with response to vein bypass grafting in rabbits in two treatments, high flow and low flow.

Plot of BIC values calculated for expression trajectories of different gene clusters over cluster number and LOP order.
Figure 2:

Plot of BIC values calculated for expression trajectories of different gene clusters over cluster number and LOP order.

Table 1:

Estimated proportions of gene clusters and standard errors (in parentheses) estimated by resampling for 14 958 genes associated with response to vein bypass grafting in rabbits under two different treatments, high flow and low flow. The significance of gene–environment interactions for each cluster is also given

ClusterABCDEFGH
Proportion0.0116 (0.0019)0.0123 (0.0068)0.3354 (0.0056)0.3831 (0.0075)0.1134 (0.0062)0.0359 (0.0033)0.0100 (0.0012)0.0083 (0.0010)
Gene–environment interaction test
P-value<0.0001>0.100>0.250<0.0001>0.400<0.0001<0.001<0.0001
ClusterABCDEFGH
Proportion0.0116 (0.0019)0.0123 (0.0068)0.3354 (0.0056)0.3831 (0.0075)0.1134 (0.0062)0.0359 (0.0033)0.0100 (0.0012)0.0083 (0.0010)
Gene–environment interaction test
P-value<0.0001>0.100>0.250<0.0001>0.400<0.0001<0.001<0.0001
Table 1:

Estimated proportions of gene clusters and standard errors (in parentheses) estimated by resampling for 14 958 genes associated with response to vein bypass grafting in rabbits under two different treatments, high flow and low flow. The significance of gene–environment interactions for each cluster is also given

ClusterABCDEFGH
Proportion0.0116 (0.0019)0.0123 (0.0068)0.3354 (0.0056)0.3831 (0.0075)0.1134 (0.0062)0.0359 (0.0033)0.0100 (0.0012)0.0083 (0.0010)
Gene–environment interaction test
P-value<0.0001>0.100>0.250<0.0001>0.400<0.0001<0.001<0.0001
ClusterABCDEFGH
Proportion0.0116 (0.0019)0.0123 (0.0068)0.3354 (0.0056)0.3831 (0.0075)0.1134 (0.0062)0.0359 (0.0033)0.0100 (0.0012)0.0083 (0.0010)
Gene–environment interaction test
P-value<0.0001>0.100>0.250<0.0001>0.400<0.0001<0.001<0.0001

By calculating the posterior probabilities of each gene that belongs to different clusters using Equation (7), we can determine the most likely cluster of this gene. Thus, we can draw gene expression trajectories for all genes that belong to a particular cluster A–H, separately for the two flows and the mean trajectory of the cluster for each flow using the estimates of curve parameters (Figure 3). The 95% confidence intervals of each estimated trajectory are generally within the variation of temporal gene expression profiles among individual genes, suggesting that our estimates are reasonably accurate. In general, most individual gene expression trajectories display similar time-varying trends between the two flows, but marked discrepancies in expression trajectories were detected for some particular clusters.

Mean expression trajectories (thick purple curve) of each of the eight gene clusters, labeled (A)–(H), over time under high flow and low flow. The two broken curves in each plot are the 95% confidence intervals of each estimated mean trajectory. Expression trajectories of all those genes that belong to a particular cluster, determined by the posterior probabilities, are shown in light blue.
Figure 3:

Mean expression trajectories (thick purple curve) of each of the eight gene clusters, labeled (A)–(H), over time under high flow and low flow. The two broken curves in each plot are the 95% confidence intervals of each estimated mean trajectory. Expression trajectories of all those genes that belong to a particular cluster, determined by the posterior probabilities, are shown in light blue.

Hypothesis test (10) allows the detection of gene–environment interactions in expression profiles over high and low flows. Table 1 gives the results of significance tests for gene–environment interactions. Except for clusters B, C and E, all other clusters are expressed differently between high and low flows. To better show treatment-dependent expression differences, we draw the mean trajectories of each cluster from high and low flows in the same plot (see Supplementary Figure S2). The expression of cluster A has the highest level right after implantation, decreases drastically within 25–35 days and then increases gradually. This cluster displays a more pronounced change of expression over time in high flow than low flow. Cluster B has a similar trend of gene expression profiles although its time-varying change is milder compared to cluster A. It appears that clusters C and D have a minimum level of expression throughout experimental time. Clusters E, F and H are expressed at a low level in the beginning of treatment, reach a peak at Day 50 after implantation. Cluster H changes its expression level over time most abruptly, followed by clusters F and E. The expression of cluster G increases over time monotonously until Days 100–120, after which its expression decreases although it is more striking in low flow than high flow.

Simulation

We performed simulation studies to examine the statistical properties of the clustering model. The expression data were simulated by mimicking the structure of the real data analyzed above. A total of 4800 genes were assumed to include eight different clusters in two environments specified by mean trajectories as shown in Figure 3. These genes each have an expression trajectory over four time points as the sum of the mean trajectory of the underlying cluster and residual errors whose covariance structure follows the first-order AR model, but assuming the value of variance that triples the estimated variance. The proportions of eight clusters in Table 1 were used to simulate gene expression profiles.

Simulated data were analyzed by the model. Based on BIC values, the model can detect the correct number of clusters and the correct order of LOP. The model estimates the proportions of clusters A–H precisely. Also, expression trajectories of each gene cluster can be reasonably well estimated (Figure 4), despite a tripled variance used. This shows that the results from the real data set analyzed by the model are convincing from a statistical point of view.

Comparison of estimated (solid) and true (dot) mean expression trajectories for each of the eight clusters, labeled (A)–(H), under two hypothesized treatments, high flow and low flow. The simulated data mimicked the structure of the rabbit data, but assuming an error variance that triples the estimated variance. The 95% confidence intervals (broken curves) of each estimated mean trajectory are given. Expression trajectories of all those genes that belong to a particular cluster are shown in light blue.
Figure 4:

Comparison of estimated (solid) and true (dot) mean expression trajectories for each of the eight clusters, labeled (A)–(H), under two hypothesized treatments, high flow and low flow. The simulated data mimicked the structure of the rabbit data, but assuming an error variance that triples the estimated variance. The 95% confidence intervals (broken curves) of each estimated mean trajectory are given. Expression trajectories of all those genes that belong to a particular cluster are shown in light blue.

An additional simulation was conducted to test the power of the model to detect gene–environment interactions and its false positive rates (FPR). Consider clusters A, D, F, G and H obtained from rabbit gene expression data which display a certain level of gene–environment interactions. By repeating the simulation and estimation procedure 100 times, we detected the numbers of cases in which hypothesis tests for gene–environment interactions using test (10) are significant, are 85–97 for clusters A, D, F, G and H, respectively. To test the FPR, we used the same parameters for clusters B, C and E to simulate gene expression data for high and low flows. Of 100 simulation replicates, less than 5 were detected to be significant. This suggests that the FPR of our tool is acceptably low.

DISCUSSION

There is a pressing need for computational tools that can unravel the developmental machinery of time-dependent gene expression profiles, despite a vast body of literature presenting these tools [4–6, 8, 10–12]. One significant lack is the unavailability of models for analyzing gene–environment interactions for gene expression dynamics. A few pioneering studies have demonstrated the capacity of gene transcript abundance to comprehend the genetic architecture of gene–environment interactions [16, 22, 23].

In this article, we describe a computational tool that can test gene–environment interactions on a genomic level using dynamic gene expression data. The developmental dynamics of cells, organs or organisms are related with many fundamental phenomena in biology, such as growth and phenotypic plasticity. Our understanding of how developmental dynamics is regulated through a balance of gene and environment helps to reveal the mechanistic origins of these phenomena. The new tool may provide an important means for analyzing temporal expression data and construct a map of gene–environment interactions. As an example, we used Legendre-based nonparametric fitting to model dynamic changes of gene expression within a mixture-model framework. One advantage of the Legrendre approach lies in its flexibility for curve fitting [52], computational efficiency and avoidance of knot choice essential for B-splines. For gene expression data with explicit curves, such as periodic transcriptional profiles, robust mathematical equations with better biological relevance and better parsimony can be used.

Gene–environment interactions are important to understand biology and can now be tested in a quantitative way by the tool presented. By applying it to a real data set of microarray genes associated with response to vein bypass grafting between high and low flows in rabbits, this tool identified eight distinct patterns of expression trajectories in a time course. Each of these patterns may be related with a particular biological function operational in hemodynamic processes. Although many patterns display a similar trend of time-varying expression between high and low blood flows, many of them are essentially different based on hypothesis tests by our tool. A further molecular study may help identify specific biochemical pathways related to each of these different gene expression patterns. In a recent study, Cohen et al. [3] detected the discrepancy of dynamic trajectories for a few proteins, which corresponds to cell death or survival, between seemingly identical cells. The tool presented should gain more quantitative insights into the distinction of seemingly trivial differences in genomic and proteomic dynamic studies.

While modeling gene expression dynamics jointly over time and environment in the real example, we assume no interactions between time and environment. Although it facilitates our modeling and computing, this assumption may not be realistic in some cases. A general model should consider the dependence of gene expression profiles in different times and environments by nonseparable covariance structuring approaches (see [62–65] for examples). Different approaches for covariance structure based on nonparametric or semiparametric models should be incorporated and compared using penalized likelihood criteria. In our software package for environment-dependent functional clustering, we implement many of these approaches for users to choose the most parsimonious one given their particular data sets. The computer code for the tool developed is available at Penn State Center for Statistical Genetics web site, http://statgen.psu.edu.

SUPPLEMENTARY DATA

Supplementary data are available online at http://bib.oxfordjournals.org/.

Key Points

  • An organism adapts to the environment where it grows through the alteration of gene expression dynamics.

  • The identification of dynamic gene expression patterns in response to environmental change can facilitate our understanding of the mechanistic base of gene–environment interactions essential for organismic development.

  • This article presents a comprehensive computational framework for clustering gene expression dynamics over a range of environments. The framework model is validated through simulation and real data analysis.

  • Our goal is to provide interested researchers with the computational tools that are applicable to modeling and analysis of dynamic gene expression data derived from multiple environments.

FUNDING

NIH/NHLBI (grant R01 HL095508); NIDA/NIH (grants R21-DA024260 and P50-DA10075); National Natural Science Foundation of China (grant 11028103). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NHLBI, NIDA, or the NIH.

References

1
Arbeitman
MN
Furlong
EM
Imam
F
et al.
,
Gene expression during the life cycle of Drosophila melanogaster
Science
,
2002
, vol.
297
(pg.
2270
-
5
)
2
Rustici
G
Mata
J
Kivinen
K
et al.
,
Periodic gene expression program of the fission yeast cell cycle
Nat Genet
,
2004
, vol.
36
(pg.
809
-
17
)
3
Cohen
AA
Geva-Zatorsky
N
Eden
E
et al.
,
Dynamic proteomics of individual cancer cells in response to a drug
Science
,
2008
, vol.
322
(pg.
1511
-
6
)
4
Holter
NS
Maritan
A
Cieplak
M
,
Dynamic modeling of gene expression data
Proc Natl Acad Sci USA
,
2001
, vol.
98
(pg.
1693
-
8
)
5
Zhao
LP
Prentice
R
Breeden
L
,
(Statistical modeling of large microarray data sets to identify stimulus-response profiles
Proc Natl Acad Sci USA
,
2001
, vol.
98
(pg.
5631
-
6
)
6
Ramoni
MF
Sebastiani
P
Kohane
IS
,
Cluster analysis of gene expression dynamics
Proc Natl Acad Sci USA
,
2002
, vol.
99
(pg.
9121
-
6
)
7
Park
T
Yi
SG
Lee
S
et al.
,
Statistical tests for identifying differentially expressed genes in time-course microarray experiments
Bioinformatics
,
2003
, vol.
19
(pg.
694
-
703
)
8
Bar-Joseph
Z
Gerber
GK
Gifford
DK
et al.
,
Continuous representations of time-series gene expression data
J Comput Biol
,
2003
, vol.
10
(pg.
341
-
56
)
9
Luan
Y
Li
H
,
Clustering of time-course gene expression data using a mixed-effects model with B-splines
Bioinformatics
,
2003
, vol.
19
(pg.
474
-
82
)
10
Ernst
J
Nau
GJ
Bar-Joseph
Z
,
Clustering short time series gene expression data
Bioinformatics
,
2005
, vol.
21
(pg.
i159
-
68
)
11
Ma
P
Castillo-Davis
CI
Zhong
W
et al.
,
A data-driven clustering method for time course gene expression data
Nucleic Acids Res
,
2006
, vol.
34
(pg.
1261
-
9
)
12
Inoue
LY
Neira
M
Nelson
C
et al.
,
Cluster-based network model for time-course gene expression data
Biostatistics
,
2007
, vol.
8
(pg.
507
-
25
)
13
Muller
HG
Chiou
JM
Leng
X
,
Inferring gene expression dynamics via functional regression analysis
BMC Bioinformatics
,
2008
, vol.
9
pg.
60
14
Kim
B-R
Zhang
L
Berg
A
et al.
,
A computational approach to the functional clustering of periodic gene expression profiles
Genetics
,
2008
, vol.
180
(pg.
821
-
34
)
15
Kim
B-R
McMurry
T
Zhao
W
et al.
,
Wavelet-based functional clustering for high-dimensional dynamic gene expression patterns
J Comput Biol
,
2010
, vol.
17
(pg.
1067
-
80
)
16
Smith
EN
Kruglyak
L
,
Gene-environment interaction in yeast gene expression
PLoS Biol
,
2008
, vol.
6
pg.
e83
17
McAdams
HH
Srinivasan
B
Arkin
AP
,
The evolution of genetic regulatory systems in bacteria
Nat Rev Genet
,
2004
, vol.
5
(pg.
169
-
78
)
18
Seshasayee
A
Bertone
P
Fraser
G
et al.
,
Transcriptional regulatory networks in bacteria: from input signals to output responses
Curr Opin Microbiol
,
2006
, vol.
9
(pg.
511
-
9
)
19
Wittkopp
P
,
Variable gene expression in eukaryotes: a network perspective
J Exp Biol
,
2007
, vol.
210
(pg.
1567
-
75
)
20
Zhao
W
Zhu
J
Gallo-Meagher
M
et al.
,
A unified statistical model for functional mapping of genotype environment interactions for ontogenetic development
Genetics
,
2004
, vol.
168
(pg.
1751
-
62
)
21
Zhao
W
Ma
C-X
Cheverud
JM
et al.
,
A unifying statistical model for QTL mapping of genotype-sex interaction for developmental trajectories
Physiol Genom
,
2004
, vol.
19
(pg.
218
-
27
)
22
Landry
CR
Oh
J
Hartl
DL
et al.
,
Genome-wide scan reveals that genetic variation for transcriptional plasticity in yeast is biased towards multi-copy and dispensable genes
Gene
,
2006
, vol.
366
(pg.
343
-
51
)
23
Li
Y
Alvarez
OA
Gutteling
EW
et al.
,
Mapping determinants of gene expression plasticity by genetical genomics in C elegans
PLoS Genet
,
2006
, vol.
2
pg.
e222
24
DeLichtenberg
U
Jensen
LJ
Brunak
S
et al.
,
Dynamic complex formation during the yeast cell cycle
Science
,
2005
, vol.
307
(pg.
724
-
7
)
25
Jiang
Z
Wu
L
Miller
BL
et al.
,
A novel vein graft model: adaptation to differential flow environments
Am J Physiol Heart Circ Physiol
,
2004
, vol.
286
(pg.
H240
-
5
)
26
Fernandez
CM
Goldman
DR
Jiang
Z
et al.
,
Impact of shear stress on early vein graft remodeling: a biomechanical analysis
Ann Biomed Eng
,
2004
, vol.
32
(pg.
1484
-
93
)
27
Spellman
PT
Sherlock
G
Zhang
MQ
et al.
,
Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization
Mol Biol Cell
,
1998
, vol.
9
(pg.
3273
-
97
)
28
Whitfield
ML
Sherlock
G
Saldanha
AJ
et al.
,
Identification of genes periodically expressed in the human cell cycle and their expression in tumors
Mol Biol Cell
,
2002
, vol.
13
(pg.
1977
-
2000
)
29
Shedden
K
Cooper
S
,
Analysis of cell-cycle gene expression in Saccharomyces cerevisiae using microarrays and multiple synchronization methods
Nucleic Acids Res
,
2002
, vol.
30
(pg.
2920
-
9
)
30
Breeden
LL
,
Periodic transcription: a cycle within a cycle
Curr Biol
,
2003
, vol.
13
(pg.
R31
-
8
)
31
Rustici
G
Mata
J
Kivinen
K
et al.
,
Periodic gene expression program of the fission yeast cell cycle
Nat Genet
,
2004
, vol.
36
(pg.
809
-
17
)
32
Ahdesmaki
M
Lhdesmaki
H
Pearson
R
et al.
,
Robust detection of periodic time series measured from biological systems
BMC Bioinformatics
,
2005
, vol.
6
pg.
117
33
von Bertalanffy
L
,
Quantitative laws for metabolism and growth
Quart Rev Biol
,
1957
, vol.
32
(pg.
217
-
31
)
34
Richards
FJ
,
A flexible growth function for empirical use
J Exp Bot
,
1959
, vol.
10
(pg.
290
-
300
)
35
West
GB
Brown
JH
Enquist
BJ
,
A general model for ontogenetic growth
Nature
,
2001
, vol.
413
(pg.
628
-
31
)
36
Guiot
C
Degiorgis
PG
Delsanto
PP
et al.
,
Does tumor growth follow a ‘universal law’?
Theor Biol
,
2003
, vol.
225
(pg.
147
-
51
)
37
Guiot
C
Delsanto
PP
Carpinteri
A
et al.
,
The dynamic evolution of the power exponent in a universal growth model of tumors
J Theor Biol
,
2006
, vol.
240
(pg.
459
-
63
)
38
Li
N
Das
K
Wu
RL
,
Functional mapping of human growth trajectories
J Theor Biol
,
2010
, vol.
261
(pg.
33
-
42
)
39
Perelson
AS
Neumann
AU
Markowitz
M
et al.
,
HIV-1 dynamics in vivo: Virion clearance rate, infected cell life-span, and viral generation time
Science
,
1996
, vol.
271
(pg.
1582
-
6
)
40
Ahn
K
Luo jt, Berg
A
et al.
,
Functional mapping of drug response with pharmacodynamic-pharmcokinetic principles
Trend Pharmacol Sci
,
2010
, vol.
31
(pg.
306
-
11
)
41
Kingsolver
JG
Woods
HA
,
Thermal sensitivity of growth and feeding in Manduca sexta caterpillars
Physiol Zool
,
1997
, vol.
70
(pg.
631
-
8
)
42
Tobalske
BW
Hedrick
TL
Dial
KP
et al.
,
Comparative power curves in bird flight
Nature
,
2003
, vol.
421
(pg.
363
-
6
)
43
Lin
M
Zhao
W
Wu
RL
,
A statistical framework for genetic association studies of power curves in bird flight
Biol Proced Online
,
2006
, vol.
8
(pg.
164
-
74
)
44
Wu
JS
Zhu
J
Zeng
YR
et al.
,
Functional mapping of norm reactions to environmental signals
Genet Res
,
2007
, vol.
89
(pg.
27
-
38
)
45
Daub
CO
Steuer
R
Selbig
J
et al.
,
Estimating mutual information using B-spline functions-an improved similarity measure for analysing gene expression data
BMC Bioinformatics
,
2004
, vol.
5
pg.
118
46
Borgwardt
KM
Vishwanathan
SV
Kriegel
HP
,
Class prediction from time series gene expression profiles using dynamical systems kernels
Pac Symp Biocomput
,
2006
, vol.
11
(pg.
547
-
58
)
47
Silverman
B
,
Spline smoothing: the equivalent variable kernel method
Ann Stat
,
1984
, vol.
12
(pg.
898
-
916
)
48
Lin
M
Wu
RL
,
A joint model for nonparametric functional mapping of longitudinal trajectories and time-to-events
BMC Bioinformatics
,
2006
, vol.
7
pg.
138
49
Cui
YH
Zhu
J
Wu
RL
,
Functional mapping for genetic control of programmed cell death
Physiol Genom
,
2006
, vol.
25
(pg.
458
-
69
)
50
Cui
YH
Wu
RL
Casella
G
,
Nonparametric functional mapping quantitative trait loci underlying programmed cell death
Stat Appl Genet Mol Biol
,
2008
, vol.
7
pg.
Article 4
51
Yang
RQ
Gao
HJ
Wang
X
et al.
,
A semiparametric model for composite functional mapping of dynamic quantitative traits
Genetics
,
2007
, vol.
177
(pg.
1859
-
70
)
52
Huskova
M
Sen
OK
,
On sequentially adaptive asymptotically efficient rank statistics
Sequent Analy
,
1985
, vol.
4
(pg.
125
-
51
)
53
McKay
MD
,
Non-parametric variance-based methods for assessing uncertainty importance
Reliability Engin Syst Safety
,
1997
, vol.
57
(pg.
267
-
79
)
54
Meyer
K
,
Random regressions to model phenotypic variation in monthly weights of Australian beef cows
Livestock Prod Sci
,
2000
, vol.
65
(pg.
19
-
38
)
55
Diggle
PJ
Heagerty
P
Liang
KY
et al.
Analysis of Longitudinal Data
,
2002
Oxford
Oxford University Press
56
Box
G
Jenkins
G
Reinsel
G
Time Series Analysis: Forecasting and Control
,
1976
San-Francisco
Holden-day
57
Haddad
J
,
On the closed form of the covariance matrix and its inverse of the causal ARMA process
J Time Series Anal
,
2004
, vol.
25
(pg.
443
-
8
)
58
Brockwell
P
Davis
R
Time Series: Theory and Methods
,
1991
New York
Springer
59
Zimmerman
DL
Nunez-Anton
V
,
Parametric modeling of growth curve data: an overview (with discussion)
Test
,
2001
, vol.
10
(pg.
1
-
73
)
60
Zhao
W
HouW, Littell
RC
et al.
,
Structured antedependence models for functional mapping of multivariate longitudinal quantitative traits
Stat Appl Mol Genet Biol
,
2005
, vol.
4
pg.
Article 33
61
Fan
J
Yao
Q
Nonlinear Time Series: Nonparametric and Parametric Methods
,
2003
New York
Springer
62
Cressie
N
Huang
H-C
,
Classes of nonseparable, spatio-temporal stationary covariance functions
J Am Stat Assoc
,
1999
, vol.
94
(pg.
1330
-
40
)
63
Gneiting
T
,
Nonseparable, stationary covariance functions for space-time data
J Am Stat Assoc
,
2002
, vol.
97
(pg.
590
-
600
)
64
Gneiting
T
Genton
M
Guttorp
P
Finkenstadt
B
Held
L
Isham
V
,
Geostatistical space-time models, stationary, separability and full symmetry
Statistical Methods for Spatio-temporal Systems, (Monographs on Statistics and Applied Probability)
,
2006
Boca Raton, Florida
Chapman and Hall/CRC
65
Yap
JS
Li
Y
Das
K
et al.
,
Functional mapping of reaction norms to multiple environmental signals through nonparametric covariance estimation
BMC Plant Biol
,
2011
, vol.
11
pg.
23

Supplementary data