How to cluster gene expression dynamics in response to environmental signals

Wang, Yaqun; Xu, Meng; Wang, Zhong; Tao, Ming; Zhu, Junjia; Wang, Li; Li, Runze; Berceli, Scott A.; Wu, Rongling

doi:10.1093/bib/bbr032

Abstract

Organisms usually cope with change in the environment by altering the dynamic trajectory of gene expression to adjust the complement of active proteins. The identification of particular sets of genes whose expression is adaptive in response to environmental changes helps to understand the mechanistic base of gene–environment interactions essential for organismic development. We describe a computational framework for clustering the dynamics of gene expression in distinct environments through Gaussian mixture fitting to the expression data measured at a set of discrete time points. We outline a number of quantitative testable hypotheses about the patterns of dynamic gene expression in changing environments and gene–environment interactions causing developmental differentiation. The future directions of gene clustering in terms of incorporations of the latest biological discoveries and statistical innovations are discussed. We provide a set of computational tools that are applicable to modeling and analysis of dynamic gene expression data measured in multiple environments.

dynamic gene expression, functional clustering, gene–environment interaction, mixture model

INTRODUCTION

The development of high-throughput technologies, such as DNA microarrays and proteomics platforms, has made it possible to ask and address many fundamental but difficult questions in developmental biology and biomedicine. For example, variation in the pattern of gene expression may reflect unique physiological or pathological properties of individual cells, organs or organisms that cannot be observed readily or directly [1, 2]. By screening approximately 1000 proteins in individual cancer cells, Cohen et al. [3] detected a subset of proteins whose expression displays different dynamic patterns between seemingly identical cancer cells that actually have different fates. To identify distinct patterns of gene expression dynamics from a flood of microarray data, powerful computational tools for clustering genes or proteins based on their dynamic profiles have become essential. In the past decade, enormous efforts have been made to develop computational methods for cataloguing gene expression dynamics and use these distinct patterns to assess developmental functions and mechanisms of biological phenomena [4–15]. There is also a pressing need for computational approaches to cluster analysis of dynamic gene expression that interacts with the environment, because the activation and expression of many genes is environment-contingent [16]. However, to our best knowledge, such approaches have not been systematically described in the literature, obstructing geneticists to analyze time series data in a way that can detect complex effects on expression patterns.

In biology, there is a widespread phenomenon that organisms cope with biotic and abiotic environments by controlling gene expression to harness the complement of active proteins [17–19]. When the environment alternates between discrete states, organisms will stimulate their regulatory system through adjusting gene transcription rates to best adapt to the environment. For this reason, by detecting the difference in the pattern of gene expression trajectories between discrete environments, we will be in a better position to study interactions between genes and environments and dictate a comprehensive map of gene–environment relationships. Traditional approaches for studying gene–environment interactions are based on quantitative trait locus (QTL) mapping usually with experimental crosses. Significant gene–environment interactions are identified if specific QTLs are detected, through statistical tests, to display different effects between environments [20, 21]. More recently, gene transcript abundance has been used to study gene–environment interactions in many organisms such as yeast [16, 22] and worms [23], but all these studies are limited to gene expression data measured at single time points of development.

The purpose of this article is to describe a general framework for identifying environment-specific clusters of gene expression. The use of gene expression dynamics to understand gene–environment interactions is highly informative because of its capacity to identify development-related genes. However, this is a challenging work in terms of gene clustering across two-dimensional spaces, time and environmental state. The framework described in this article integrates developmental and environment-dependent programs of gene expression. Mathematical aspects of gene expression dynamics are implemented into a mixture model setting by considering the impact of environment on gene expression. The patterns of gene expression related to specific physiological functions can be parsimoniously modeled using a set of mathematical parameters [14, 24]. Thus, by estimating the parameters that determine mathematical functions, the pattern of how genes change their level of expression over time and environment can be estimated and tested. The results from these models, therefore, can better be interpreted in a biologically sensible way. In addition, the framework considers the intrinsic structure of time-dependent correlations based on an optimal statistical process, which increases the power of detecting significantly differentiated patterns. To demonstrate its usefulness and utilization in practice, we use this framework to analyze a real data set from a rabbit hemodynamic study in which gene expression is observed in two distinct blood flow environments [25, 26]. We evaluated the advantages of this tool by performing simulation studies. The simulation results illustrated that the tool has favorable statistical properties and can be used in any environment-dependent gene expression data.

GENE-CLUSTERING FRAMEWORK

Statistical model

Suppose there are n genes each measured at T time points in L environments. Let

denote the gene expression data for gene i in environment l. Combining all the environments, we have

⁠. If these genes are grouped into J clusters, it means that any one of genes (i ) is assumed to arise from one (and only one) of the J possible clusters. Thus, the phenotypic value of gene i expressed at time

in environment l is written as

(1)

where

is an indicator for gene i, defined as 1 if this gene belongs to cluster j and 0 otherwise,

is the mean of all genes belonging to cluster j at at time

in environment l,

is the value of covariate c (⁠

⁠) for gene i,

is the effect of covariate c, and

is the residual assumed to follow a Gaussian distribution with mean zero and variance

⁠. For longitudinal data, residual errors at different time points may be correlated with covariance

(⁠

At least one of the inequalities,

holds). The residual variances and covariance comprise a (⁠

⁠) covariance matrix

⁠.

The distribution of gene expression data is expressed as the J-component mixture probability density function, i.e.

(2)

where

is a vector of mixture proportions which are nonnegative and sum to unity;

contains the mean vector of cluster j; and

contains residual variances and covariances among T time points over L environments which are common for all clusters. The probability density function of cluster j,

⁠, is assumed to be multivariate normally distributed with TL-dimensional mean vector

(3)

and covariance matrix

⁠. Notice that

contains gene-specific covariate effects.

The likelihood based on a mixture model containing J clusters can be written as

(4)

where

is a vector of unknown parameters including the mixture proportions, cluster-specific mean vectors and covariance.

Different from traditional treatments, we will incorporate mathematical and statistical models to fit the mean-covariance structures. Instead of estimating all elements in the vectors and covariance, we estimate the mathematical and statistical parameters that model the mean-covariance structures. Thus, the question of clustering gene expression dynamics becomes how to find a set of parameters (arrayed in ⁠) that models cluster-specific expression profiles in a biologically and statistically meaningful way and to find a set of parameters (arrayed in ⁠) that models the covariance structure both parsimoniously and flexibly.

Structural modeling of mean vectors

Since the transcript levels of DNA microarrays generally vary in a time course, we may use mathematical and statistical models to approach their dynamic changes. Below is a list of approaches for modeling time-dependent gene expression profiles:

Parametric modeling

For clock gene cases, the amount of mRNAs within the cell division cycle changes periodically, coincident with the cell cycle, which helps to maintain proper order during cell division or to conserve limited resources. The oscillation of cell cycle-regulated genes can be mathematically described by periodic Fourier functions or other functions [27–32]. The Fourier function can be approximated by its first K terms, expressed as

The coefficients

and

determine the times at which the expression level achieves maximums and minimums,

is the average expression level of the gene and

specifies the periodicity of the regulation. From Equation (3), the mean expression value of gene cluster j at time

in environment l is expressed as

where

denotes the vector of Fourier parameters of the first K orders. Thus, by estimating the parameters that define the periodic curves for individual clusters, we can determine the differences in the temporal pattern of gene expression [14].

There are many other biologically well-justified curves that can be used to model gene expression dynamics. These include sigmoid equations for gene expression related to biological growth [33–37], triple-logistic equations for gene expression related to human body growth [38], bi-exponential equations for gene expression related to HIV dynamics [39], sigmoid Emax models for gene expression related to pharmacodynamic response [40], biological thermal dynamics [41], aerodynamic power curves for gene expression related to bird flight [42, 43], hyperbolic curves for gene expression related to photosynthetic reaction [44], etc.

Nonparametric modeling

If time-varying expression of genes does not obey an explicit mathematical function, nonparametric approaches, such as kernel estimators or B-splines, can be used [9, 45, 46]. Kernel estimators are based on local polynomial regression, whereas smoothing splines use a piece-wise polynomial function. As shown by Silverman [47], kernel smoothing and smoothing-spline smoothing are asymptotically equivalent for independent data and splines are higher-order kernels. More recently, Legendre orthogonal polynomials (LOP) have been used to model dynamic changes of complex traits that cannot be explicitly described by a mathematical equation [48–51]. Since the LOP are orthogonal to each other and integrate to 0 in the interval [−1, 1], nonparametric estimates derived from this approach display favorable asymptotic properties [52, 53]. The LOP have been successfully used to model time-varying phenotypic or genetic changes for many complex traits, such as milk production [54] and plant height growth [48, 49]. As will be seen from an example below, it should be equally useful for modeling the dynamic pattern of gene expression profiles in a time course.

Semiparametric modeling

If gene expression spans multiple distinct stages (see [13]), at some of which the expression values follow a parametric form but at others of which they do not, we can implement a semiparametric model that combines the parsimony and biological relevance of parametric approaches with the flexibility of nonparametric approaches. As explained by Cui et al. [49], such a semiparametric approach was used to model the growth process and death process of tiller number in a lifetime of rice. A similar semiparametric approach can also be implemented in the clustering framework of multistage gene expression dynamics. This will enable us to study dynamic changes of gene expression by relating its temporal profiles from different developmental stages.

Structural modeling of covariance

Unstructured estimates of a longitudinal covariance matrix may be highly unstable for large matrices. This, in conjunction with the fact that the covariance among repeated measures over time has an inherent structure [55], implies that structuring a covariance matrix with few parameters may be crucial for parsimonious and efficient parameter estimates in dynamic gene clustering. An extreme of covariance structuring is compound symmetry and autoregression of order one, but this may be far from the true covariance, leading to severe bias. The best covariance estimator should be at the balance between its variance and bias. Below, we list several commonly used approaches for covariance structure.

Autoregressive moving-average process(p,q)

The autoregressive moving-average process, ARMA( p, q) [56], is flexible to provide a robust estimate of gene expression covariance structure [38]. The zero-mean residual error

in environment l (1) is generated according to the following process

(5)

where

and

are unknown parameters and

is a sequence of independent and identically distributed normal random variables with zero mean and variance

⁠. The ARMA(p,q) model parameters are arrayed in

⁠. The merit of the ARMA model includes the existence of closed forms for the estimates of the inverse and determinant of the structured covariance matrix [57, 58], which enhances computational efficiency.

By various constraints, the ARMA model can be reduced to a simple autoregressive (AR) model and structured antedependence (SAD) model [59]. Both the first-order AR and first-order SAD models use only two parameters, but the latter is more flexible than the former since the latter allows the variance and correlation to change over time. The SAD model has been successfully incorporated in functional mapping of dynamic complex traits [60]. For the ARMA model, it is important to determine its optimal order to model covariance structure. A model selection procedure based on penalized likelihood criteria, such as AIC and BIC, can be established to determine an optimal approach.

Kernel smoothing

The kernel smoothing method has been used to estimate longitudinal covariances [61]. The advantage of this method lies in its flexibility to specify any form of covariances and asymptotic properties. Under the homogeneous assumption, the covariance of gene expression between any two time points

and

in environment l is written as a function of time interval, i.e. Cov

⁠. Kernel smoothing describes this covariance by

(6)

where n is the total number of genes, T is the total number of time points,

is a kernel function, h is a bandwidth and

is the mean of gene i, given by

⁠. One of the mostly used kernel functions is Gaussian kernel, i.e.

⁠. A variety of statistical methods have been developed to choose an optimal kernel and appropriate bandwidth (see [61]).

Modeling covariance over time and environment

In this article, we consider gene expression dynamics in multiple environments. The approaches described above are used to model time-dependent covariances of gene expression separately for each environment. These approaches can be similarly used to model covariances over different environments at each time point. The covariance structure over time and environment is then modeled by taking the product of purely temporal and environmental covariances. This so-called separable approach is simple but has many undesirable properties since it does not allow environment–time interactions. We can implement a nonseparable stationary model (see [62–64]) to structure time-environment covariance of gene expression. A nonseparable covariance is not expressed as a Kronecker product of two matrices like separable structures. An advantage of covariance modeling in this context is in providing a better characterization of the random process to obtain optimal kriging or prediction of unobserved portions of it. In functional mapping of dynamic complex traits, Yap et al. [65] has successfully incorporated Cressie and Huang's [62] nonseparable model to estimate the covariance of photosynthetic rate over temperature and irradiance. A similar idea can be developed to use this nonseparable model to fit the spatiotemporal covariance of gene expression.

Estimation and tests

A hybrid EM-simplex algorithm was implemented to estimate the parameters, ⁠, contained in the likelihood (4). The EM algorithm provides a platform for estimating the proportions of different gene clusters, within which the simplex algorithm is embedded to estimate base vectors for each cluster and the covariance-structuring parameters. This can be described as follows:

In the E step, we define and estimate the posterior probabilities of gene i, with which it belongs to a particular expression pattern j, by

(7)

In the M step, the proportion of expression pattern j is calculated by

(8)

Mean-covariance structuring parameters in

and

are estimated in this step, but no closed forms can be derived for their estimators. The EM algorithm was derived to estimate these parameters when gene expression dynamics is modeled by Fourier series approximation and the covariance modeled by the ARMA process [14, 38]. In general, the simplex algorithm, which does not depend on explicit equations, can be implemented to estimate these structuring parameters.

The framework for clustering gene expression dynamics over multiple environments allows geneticists to address many biologically meaningful questions. First, an optimal number of gene clusters in terms of their different expression dynamics over all environments can be determined using AIC or BIC approaches (see the example shown below). Second, we need to determine the optimal number of gene clusters in a specific environment. For two given patterns j and j', they may be identical in an environment l, although different for all the L environments. This can be tested by

(9)

The H₀ states that gene expression over an entire time course is identical for the two gene clusters j and j ′. If the H₁ is rejected, it means that the optimal number of clusters in environment l is equal to, or less than,

⁠. By performing this pairwise test for all gene clusters, this approach allows the identification of the optimal number of expression patterns for environment l.

Third, we can test the significance of gene–environment interactions. This can be done by testing

(10)

The H₀ states that the gene expression of a cluster over an entire time course is identical between the two environments l and l ′. If the H₀ is rejected, it means that the expression of gene cluster j displays significant gene–environment interactions. This test provides a quantitative way to study the relationship between genes and environments.

Fourth, we are interested in testing how genes interact with time in a specific environment. For two given clusters j and j′, this test is expressed as

(11)

where c is a constant. The H₀ states that time-dependent gene expression of cluster j ′ can be expressed as a linear function of that of cluster j in environment l. If the H₀ is rejected, it means that this pair of gene clusters has a significant gene–time interaction under this particular environment.

WORKED EXAMPLE

Data analysis

The new tool is demonstrated by analyzing a data set of microarray genes associated with response to vein bypass grafting designed to treat arterial occlusive disease. The data were obtained using a rabbit bilateral vein graft construct, as previously described by Jiang et al. [25]. New Zealand White rabbits (weighing 3.0–3.5 kg) were treated by bilateral jugular vein interposition grafting and unilateral distal carotid artery branch ligation to create two distinct flows, i.e. two different environments. Through ligation of the internal carotid and three of the four primary branches of the external carotid artery, an immediate 6-fold difference in blood flow between the right and left vein grafts was obtained. A segment of the vein was retained at the time of implantation for baseline morphometric measurements. Vein grafts were harvested at 1, 3, 7, 14, 28, 90 and 180 days after implantation. Expression of 14 958 genes was recorded for each of these time points under both treatments, high flow and low flow. Other parameters related to hemodynamic behavior, such as graft flow rate, intraluminal pressure, mean circumferential wall stress and shear stress, were also measured or estimated.

An initial step is the selection of an appropriate model that fits the dynamic change of gene expression over time. Figure 1 shows the plotting of 10 randomly selected genes expressed over time in the two treatments, from which we found that there is a great variability in gene expression trajectories. Some are curvaceous, while others are quite flat. Thus, we used a flexible nonparametric approach based on LOP to model gene expression dynamics. Let

denote a family of LOP with a particular order r derived from a special differential equation, where

is a scaled time with a range

⁠. Let

denote a vector of base values for cluster j in environment l. Then, time-varying mean values for cluster j in environment l in Equation (2) can be expressed as a linear combination of

weighted by the family of LOP, i.e.

Our task now is to estimate the base vector

from the given data. The variance of expression among genes seems to be broadly consistent over time points, suggesting that the first-order AR model may fit the data. To combine the expression data from the two flows, we used a separable model to structure the covariance over time and environment. In Figure 2, a plot of BIC values is illustrated against varying numbers of gene clusters under different LOP orders, from which we chose eight clusters and three orders that provide a best combination for curve fitting. Implementing this combination, we estimated the expression trajectories of each of the eight gene clusters for both high and low flows. We need to detect if any of these eight clusters, labeled from A to H (see Supplementary Figure S1), overlap in a flow. Pairwise tests using hypothesis test (9) indicate that no pair of clusters is identical for expression trajectories in each flow (P < 0.0001). This result suggests that the optimal number of clusters should be eight for both flows. Table 1 gives the estimated proportions and their standard errors of each cluster from these genes.

Figure 1:

Trajectories of expression for ten genes randomly chosen from those associated with response to vein bypass grafting in rabbits in two treatments, high flow and low flow.

Open in new tab Download slide

Figure 2:

Plot of BIC values calculated for expression trajectories of different gene clusters over cluster number and LOP order.

Open in new tab Download slide

Table 1:

Open in new tab

Estimated proportions of gene clusters and standard errors (in parentheses) estimated by resampling for 14 958 genes associated with response to vein bypass grafting in rabbits under two different treatments, high flow and low flow. The significance of gene–environment interactions for each cluster is also given

Cluster	A	B	C	D	E	F	G	H
Proportion	0.0116 (0.0019)	0.0123 (0.0068)	0.3354 (0.0056)	0.3831 (0.0075)	0.1134 (0.0062)	0.0359 (0.0033)	0.0100 (0.0012)	0.0083 (0.0010)
Gene–environment interaction test
P-value	<0.0001	>0.100	>0.250	<0.0001	>0.400	<0.0001	<0.001	<0.0001

Cluster	A	B	C	D	E	F	G	H
Proportion	0.0116 (0.0019)	0.0123 (0.0068)	0.3354 (0.0056)	0.3831 (0.0075)	0.1134 (0.0062)	0.0359 (0.0033)	0.0100 (0.0012)	0.0083 (0.0010)
Gene–environment interaction test
P-value	<0.0001	>0.100	>0.250	<0.0001	>0.400	<0.0001	<0.001	<0.0001

Table 1:

Open in new tab

Estimated proportions of gene clusters and standard errors (in parentheses) estimated by resampling for 14 958 genes associated with response to vein bypass grafting in rabbits under two different treatments, high flow and low flow. The significance of gene–environment interactions for each cluster is also given

Cluster	A	B	C	D	E	F	G	H
Proportion	0.0116 (0.0019)	0.0123 (0.0068)	0.3354 (0.0056)	0.3831 (0.0075)	0.1134 (0.0062)	0.0359 (0.0033)	0.0100 (0.0012)	0.0083 (0.0010)
Gene–environment interaction test
P-value	<0.0001	>0.100	>0.250	<0.0001	>0.400	<0.0001	<0.001	<0.0001

Cluster	A	B	C	D	E	F	G	H
Proportion	0.0116 (0.0019)	0.0123 (0.0068)	0.3354 (0.0056)	0.3831 (0.0075)	0.1134 (0.0062)	0.0359 (0.0033)	0.0100 (0.0012)	0.0083 (0.0010)
Gene–environment interaction test
P-value	<0.0001	>0.100	>0.250	<0.0001	>0.400	<0.0001	<0.001	<0.0001

By calculating the posterior probabilities of each gene that belongs to different clusters using Equation (7), we can determine the most likely cluster of this gene. Thus, we can draw gene expression trajectories for all genes that belong to a particular cluster A–H, separately for the two flows and the mean trajectory of the cluster for each flow using the estimates of curve parameters (Figure 3). The 95% confidence intervals of each estimated trajectory are generally within the variation of temporal gene expression profiles among individual genes, suggesting that our estimates are reasonably accurate. In general, most individual gene expression trajectories display similar time-varying trends between the two flows, but marked discrepancies in expression trajectories were detected for some particular clusters.

Figure 3:

Mean expression trajectories (thick purple curve) of each of the eight gene clusters, labeled (A)–(H), over time under high flow and low flow. The two broken curves in each plot are the 95% confidence intervals of each estimated mean trajectory. Expression trajectories of all those genes that belong to a particular cluster, determined by the posterior probabilities, are shown in light blue.

Open in new tab Download slide

Hypothesis test (10) allows the detection of gene–environment interactions in expression profiles over high and low flows. Table 1 gives the results of significance tests for gene–environment interactions. Except for clusters B, C and E, all other clusters are expressed differently between high and low flows. To better show treatment-dependent expression differences, we draw the mean trajectories of each cluster from high and low flows in the same plot (see Supplementary Figure S2). The expression of cluster A has the highest level right after implantation, decreases drastically within 25–35 days and then increases gradually. This cluster displays a more pronounced change of expression over time in high flow than low flow. Cluster B has a similar trend of gene expression profiles although its time-varying change is milder compared to cluster A. It appears that clusters C and D have a minimum level of expression throughout experimental time. Clusters E, F and H are expressed at a low level in the beginning of treatment, reach a peak at Day 50 after implantation. Cluster H changes its expression level over time most abruptly, followed by clusters F and E. The expression of cluster G increases over time monotonously until Days 100–120, after which its expression decreases although it is more striking in low flow than high flow.

Simulation

We performed simulation studies to examine the statistical properties of the clustering model. The expression data were simulated by mimicking the structure of the real data analyzed above. A total of 4800 genes were assumed to include eight different clusters in two environments specified by mean trajectories as shown in Figure 3. These genes each have an expression trajectory over four time points as the sum of the mean trajectory of the underlying cluster and residual errors whose covariance structure follows the first-order AR model, but assuming the value of variance that triples the estimated variance. The proportions of eight clusters in Table 1 were used to simulate gene expression profiles.

Simulated data were analyzed by the model. Based on BIC values, the model can detect the correct number of clusters and the correct order of LOP. The model estimates the proportions of clusters A–H precisely. Also, expression trajectories of each gene cluster can be reasonably well estimated (Figure 4), despite a tripled variance used. This shows that the results from the real data set analyzed by the model are convincing from a statistical point of view.

Figure 4:

Comparison of estimated (solid) and true (dot) mean expression trajectories for each of the eight clusters, labeled (A)–(H), under two hypothesized treatments, high flow and low flow. The simulated data mimicked the structure of the rabbit data, but assuming an error variance that triples the estimated variance. The 95% confidence intervals (broken curves) of each estimated mean trajectory are given. Expression trajectories of all those genes that belong to a particular cluster are shown in light blue.

Open in new tab Download slide

An additional simulation was conducted to test the power of the model to detect gene–environment interactions and its false positive rates (FPR). Consider clusters A, D, F, G and H obtained from rabbit gene expression data which display a certain level of gene–environment interactions. By repeating the simulation and estimation procedure 100 times, we detected the numbers of cases in which hypothesis tests for gene–environment interactions using test (10) are significant, are 85–97 for clusters A, D, F, G and H, respectively. To test the FPR, we used the same parameters for clusters B, C and E to simulate gene expression data for high and low flows. Of 100 simulation replicates, less than 5 were detected to be significant. This suggests that the FPR of our tool is acceptably low.

DISCUSSION

There is a pressing need for computational tools that can unravel the developmental machinery of time-dependent gene expression profiles, despite a vast body of literature presenting these tools [4–6, 8, 10–12]. One significant lack is the unavailability of models for analyzing gene–environment interactions for gene expression dynamics. A few pioneering studies have demonstrated the capacity of gene transcript abundance to comprehend the genetic architecture of gene–environment interactions [16, 22, 23].

In this article, we describe a computational tool that can test gene–environment interactions on a genomic level using dynamic gene expression data. The developmental dynamics of cells, organs or organisms are related with many fundamental phenomena in biology, such as growth and phenotypic plasticity. Our understanding of how developmental dynamics is regulated through a balance of gene and environment helps to reveal the mechanistic origins of these phenomena. The new tool may provide an important means for analyzing temporal expression data and construct a map of gene–environment interactions. As an example, we used Legendre-based nonparametric fitting to model dynamic changes of gene expression within a mixture-model framework. One advantage of the Legrendre approach lies in its flexibility for curve fitting [52], computational efficiency and avoidance of knot choice essential for B-splines. For gene expression data with explicit curves, such as periodic transcriptional profiles, robust mathematical equations with better biological relevance and better parsimony can be used.

Gene–environment interactions are important to understand biology and can now be tested in a quantitative way by the tool presented. By applying it to a real data set of microarray genes associated with response to vein bypass grafting between high and low flows in rabbits, this tool identified eight distinct patterns of expression trajectories in a time course. Each of these patterns may be related with a particular biological function operational in hemodynamic processes. Although many patterns display a similar trend of time-varying expression between high and low blood flows, many of them are essentially different based on hypothesis tests by our tool. A further molecular study may help identify specific biochemical pathways related to each of these different gene expression patterns. In a recent study, Cohen et al. [3] detected the discrepancy of dynamic trajectories for a few proteins, which corresponds to cell death or survival, between seemingly identical cells. The tool presented should gain more quantitative insights into the distinction of seemingly trivial differences in genomic and proteomic dynamic studies.

While modeling gene expression dynamics jointly over time and environment in the real example, we assume no interactions between time and environment. Although it facilitates our modeling and computing, this assumption may not be realistic in some cases. A general model should consider the dependence of gene expression profiles in different times and environments by nonseparable covariance structuring approaches (see [62–65] for examples). Different approaches for covariance structure based on nonparametric or semiparametric models should be incorporated and compared using penalized likelihood criteria. In our software package for environment-dependent functional clustering, we implement many of these approaches for users to choose the most parsimonious one given their particular data sets. The computer code for the tool developed is available at Penn State Center for Statistical Genetics web site, http://statgen.psu.edu.

SUPPLEMENTARY DATA

Supplementary data are available online at http://bib.oxfordjournals.org/.

Key Points

An organism adapts to the environment where it grows through the alteration of gene expression dynamics.
The identification of dynamic gene expression patterns in response to environmental change can facilitate our understanding of the mechanistic base of gene–environment interactions essential for organismic development.
This article presents a comprehensive computational framework for clustering gene expression dynamics over a range of environments. The framework model is validated through simulation and real data analysis.
Our goal is to provide interested researchers with the computational tools that are applicable to modeling and analysis of dynamic gene expression data derived from multiple environments.

FUNDING

NIH/NHLBI (grant R01 HL095508); NIDA/NIH (grants R21-DA024260 and P50-DA10075); National Natural Science Foundation of China (grant 11028103). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NHLBI, NIDA, or the NIH.

References

1

Arbeitman

MN

Furlong

EM

Imam

F

et al. ,

Gene expression during the life cycle of Drosophila melanogaster

,

Science

,

2002

, vol.

297

(pg.

2270

-

5

)

2

Rustici

G

Mata

J

Kivinen

K

et al. ,

Periodic gene expression program of the fission yeast cell cycle

,

Nat Genet

,

2004

, vol.

36

(pg.

809

-

17

)

3

Cohen

AA

Geva-Zatorsky

N

Eden

E

et al. ,

Dynamic proteomics of individual cancer cells in response to a drug

,

Science

,

2008

, vol.

322

(pg.

1511

-

6

)

4

Holter

NS

Maritan

A

Cieplak

M

,

Dynamic modeling of gene expression data

,

Proc Natl Acad Sci USA

,

2001

, vol.

98

(pg.

1693

-

8

)

5

Zhao

LP

Prentice

R

Breeden

L

,

(Statistical modeling of large microarray data sets to identify stimulus-response profiles

,

Proc Natl Acad Sci USA

,

2001

, vol.

98

(pg.

5631

-

6

)

6

Ramoni

MF

Sebastiani

P

Kohane

IS

,

Cluster analysis of gene expression dynamics

,

Proc Natl Acad Sci USA

,

2002

, vol.

99

(pg.

9121

-

6

)

7

Park

T

Yi

SG

Lee

S

et al. ,

Statistical tests for identifying differentially expressed genes in time-course microarray experiments

,

Bioinformatics

,

2003

, vol.

19

(pg.

694

-

703

)

8

Bar-Joseph

Z

Gerber

GK

Gifford

DK

et al. ,

Continuous representations of time-series gene expression data

,

J Comput Biol

,

2003

, vol.

10

(pg.

341

-

56

)

9

Luan

Y

Li

H

,

Clustering of time-course gene expression data using a mixed-effects model with B-splines

,

Bioinformatics

,

2003

, vol.

19

(pg.

474

-

82

)

10

Ernst

J

Nau

GJ

Bar-Joseph

Z

,

Clustering short time series gene expression data

,

Bioinformatics

,

2005

, vol.

21

(pg.

i159

-

68

)

11

Ma

P

Castillo-Davis

CI

Zhong

W

et al. ,

A data-driven clustering method for time course gene expression data

,

Nucleic Acids Res

,

2006

, vol.

34

(pg.

1261

-

9

)

12

Inoue

LY

Neira

M

Nelson

C

et al. ,

Cluster-based network model for time-course gene expression data

,

Biostatistics

,

2007

, vol.

8

(pg.

507

-

25

)

13

Muller

HG

Chiou

JM

Leng

X

,

Inferring gene expression dynamics via functional regression analysis

,

BMC Bioinformatics

,

2008

, vol.

9

pg.

60

14

Kim

B-R

Zhang

L

Berg

A

et al. ,

A computational approach to the functional clustering of periodic gene expression profiles

,

Genetics

,

2008

, vol.

180

(pg.

821

-

34

)

15

Kim

B-R

McMurry

T

Zhao

W

et al. ,

Wavelet-based functional clustering for high-dimensional dynamic gene expression patterns

,

J Comput Biol

,

2010

, vol.

17

(pg.

1067

-

80

)

16

Smith

EN

Kruglyak

L

,

Gene-environment interaction in yeast gene expression

,

PLoS Biol

,

2008

, vol.

6

pg.

e83

17

McAdams

HH

Srinivasan

B

Arkin

AP

,

The evolution of genetic regulatory systems in bacteria

,

Nat Rev Genet

,

2004

, vol.

5

(pg.

169

-

78

)

18

Seshasayee

A

Bertone

P

Fraser

G

et al. ,

Transcriptional regulatory networks in bacteria: from input signals to output responses

,

Curr Opin Microbiol

,

2006

, vol.

9

(pg.

511

-

9

)

19

Wittkopp

P

,

Variable gene expression in eukaryotes: a network perspective

,

J Exp Biol

,

2007

, vol.

210

(pg.

1567

-

75

)

20

Zhao

W

Zhu

J

Gallo-Meagher

M

et al. ,

A unified statistical model for functional mapping of genotype environment interactions for ontogenetic development

,

Genetics

,

2004

, vol.

168

(pg.

1751

-

62

)

21

Zhao

W

Ma

C-X

Cheverud

JM

et al. ,

A unifying statistical model for QTL mapping of genotype-sex interaction for developmental trajectories

,

Physiol Genom

,

2004

, vol.

19

(pg.

218

-

27

)

Google Scholar

Crossref

WorldCat

22

Landry

CR

Oh

J

Hartl

DL

et al. ,

Genome-wide scan reveals that genetic variation for transcriptional plasticity in yeast is biased towards multi-copy and dispensable genes

,

Gene

,

2006

, vol.

366

(pg.

343

-

51

)

23

Li

Y

Alvarez

OA

Gutteling

EW

et al. ,

Mapping determinants of gene expression plasticity by genetical genomics in C elegans

,

PLoS Genet

,

2006

, vol.

2

pg.

e222

24

DeLichtenberg

U

Jensen

LJ

Brunak

S

et al. ,

Dynamic complex formation during the yeast cell cycle

,

Science

,

2005

, vol.

307

(pg.

724

-

7

)

25

Jiang

Z

Wu

L

Miller

BL

et al. ,

A novel vein graft model: adaptation to differential flow environments

,

Am J Physiol Heart Circ Physiol

,

2004

, vol.

286

(pg.

H240

-

5

)

26

Fernandez

CM

Goldman

DR

Jiang

Z

et al. ,

Impact of shear stress on early vein graft remodeling: a biomechanical analysis

,

Ann Biomed Eng

,

2004

, vol.

32

(pg.

1484

-

93

)

27

Spellman

PT

Sherlock

G

Zhang

MQ

et al. ,

Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization

,

Mol Biol Cell

,

1998

, vol.

9

(pg.

3273

-

97

)

28

Whitfield

ML

Sherlock

G

Saldanha

AJ

et al. ,

Identification of genes periodically expressed in the human cell cycle and their expression in tumors

,

Mol Biol Cell

,

2002

, vol.

13

(pg.

1977

-

2000

)

29

Shedden

K

Cooper

S

,

Analysis of cell-cycle gene expression in Saccharomyces cerevisiae using microarrays and multiple synchronization methods

,

Nucleic Acids Res

,

2002

, vol.

30

(pg.

2920

-

9

)

30

Breeden

LL

,

Periodic transcription: a cycle within a cycle

,

Curr Biol

,

2003

, vol.

13

(pg.

R31

-

8

)

31

Rustici

G

Mata

J

Kivinen

K

et al. ,

Periodic gene expression program of the fission yeast cell cycle

,

Nat Genet

,

2004

, vol.

36

(pg.

809

-

17

)

32

Ahdesmaki

M

Lhdesmaki

H

Pearson

R

et al. ,

Robust detection of periodic time series measured from biological systems

,

BMC Bioinformatics

,

2005

, vol.

6

pg.

117

33

von Bertalanffy

L

,

Quantitative laws for metabolism and growth

,

Quart Rev Biol

,

1957

, vol.

32

(pg.

217

-

31

)

34

Richards

FJ

,

A flexible growth function for empirical use

,

J Exp Bot

,

1959

, vol.

10

(pg.

290

-

300

)

Google Scholar

Crossref

WorldCat

35

West

GB

Brown

JH

Enquist

BJ

,

A general model for ontogenetic growth

,

Nature

,

2001

, vol.

413

(pg.

628

-

31

)

36

Guiot

C

Degiorgis

PG

Delsanto

PP

et al. ,

Does tumor growth follow a ‘universal law’?

,

Theor Biol

,

2003

, vol.

225

(pg.

147

-

51

)

Google Scholar

Crossref

WorldCat

37

Guiot

C

Delsanto

PP

Carpinteri

A

et al. ,

The dynamic evolution of the power exponent in a universal growth model of tumors

,

J Theor Biol

,

2006

, vol.

240

(pg.

459

-

63

)

38

Li

N

Das

K

Wu

RL

,

Functional mapping of human growth trajectories

,

J Theor Biol

,

2010

, vol.

261

(pg.

33

-

42

)

Google Scholar

Crossref

WorldCat

39

Perelson

AS

Neumann

AU

Markowitz

M

et al. ,

HIV-1 dynamics in vivo: Virion clearance rate, infected cell life-span, and viral generation time

,

Science

,

1996

, vol.

271

(pg.

1582

-

6

)

40

Ahn

K

Luo jt, Berg

A

et al. ,

Functional mapping of drug response with pharmacodynamic-pharmcokinetic principles

,

Trend Pharmacol Sci

,

2010

, vol.

31

(pg.

306

-

11

)

Google Scholar

Crossref

WorldCat

41

Kingsolver

JG

Woods

HA

,

Thermal sensitivity of growth and feeding in Manduca sexta caterpillars

,

Physiol Zool

,

1997

, vol.

70

(pg.

631

-

8

)

42

Tobalske

BW

Hedrick

TL

Dial

KP

et al. ,

Comparative power curves in bird flight

,

Nature

,

2003

, vol.

421

(pg.

363

-

6

)

43

Lin

M

Zhao

W

Wu

RL

,

A statistical framework for genetic association studies of power curves in bird flight

,

Biol Proced Online

,

2006

, vol.

8

(pg.

164

-

74

)

44

Wu

JS

Zhu

J

Zeng

YR

et al. ,

Functional mapping of norm reactions to environmental signals

,

Genet Res

,

2007

, vol.

89

(pg.

27

-

38

)

45

Daub

CO

Steuer

R

Selbig

J

et al. ,

Estimating mutual information using B-spline functions-an improved similarity measure for analysing gene expression data

,

BMC Bioinformatics

,

2004

, vol.

5

pg.

118

46

Borgwardt

KM

Vishwanathan

SV

Kriegel

HP

,

Class prediction from time series gene expression profiles using dynamical systems kernels

,

Pac Symp Biocomput

,

2006

, vol.

11

(pg.

547

-

58

)

Google Scholar

OpenURL Placeholder Text

WorldCat

47

Silverman

B

,

Spline smoothing: the equivalent variable kernel method

,

Ann Stat

,

1984

, vol.

12

(pg.

898

-

916

)

Google Scholar

Crossref

WorldCat

48

Lin

M

Wu

RL

,

A joint model for nonparametric functional mapping of longitudinal trajectories and time-to-events

,

BMC Bioinformatics

,

2006

, vol.

7

pg.

138

49

Cui

YH

Zhu

J

Wu

RL

,

Functional mapping for genetic control of programmed cell death

,

Physiol Genom

,

2006

, vol.

25

(pg.

458

-

69

)

Google Scholar

Crossref

WorldCat

50

Cui

YH

Wu

RL

Casella

G

,

Nonparametric functional mapping quantitative trait loci underlying programmed cell death

,

Stat Appl Genet Mol Biol

,

2008

, vol.

7

pg.

Article 4

Google Scholar

OpenURL Placeholder Text

WorldCat

51

Yang

RQ

Gao

HJ

Wang

X

et al. ,

A semiparametric model for composite functional mapping of dynamic quantitative traits

,

Genetics

,

2007

, vol.

177

(pg.

1859

-

70

)

52

Huskova

M

Sen

OK

,

On sequentially adaptive asymptotically efficient rank statistics

,

Sequent Analy

,

1985

, vol.

4

(pg.

125

-

51

)

Google Scholar

Crossref

WorldCat

53

McKay

MD

,

Non-parametric variance-based methods for assessing uncertainty importance

,

Reliability Engin Syst Safety

,

1997

, vol.

57

(pg.

267

-

79

)

Google Scholar

Crossref

WorldCat

54

Meyer

K

,

Random regressions to model phenotypic variation in monthly weights of Australian beef cows

,

Livestock Prod Sci

,

2000

, vol.

65

(pg.

19

-

38

)

Google Scholar

Crossref

WorldCat

55

Diggle

PJ

Heagerty

P

Liang

KY

et al. ,

Analysis of Longitudinal Data

,

2002

Oxford

Oxford University Press

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

56

Box

G

Jenkins

G

Reinsel

G

,

Time Series Analysis: Forecasting and Control

,

1976

San-Francisco

Holden-day

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

57

Haddad

J

,

On the closed form of the covariance matrix and its inverse of the causal ARMA process

,

J Time Series Anal

,

2004

, vol.

25

(pg.

443

-

8

)

Google Scholar

Crossref

WorldCat

58

Brockwell

P

Davis

R

,

Time Series: Theory and Methods

,

1991

New York

Springer

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

59

Zimmerman

DL

Nunez-Anton

V

,

Parametric modeling of growth curve data: an overview (with discussion)

,

Test

,

2001

, vol.

10

(pg.

1

-

73

)

Google Scholar

Crossref

WorldCat

60

Zhao

W

HouW, Littell

RC

et al. ,

Structured antedependence models for functional mapping of multivariate longitudinal quantitative traits

,

Stat Appl Mol Genet Biol

,

2005

, vol.

4

pg.

Article 33

Google Scholar

OpenURL Placeholder Text

WorldCat

61

Fan

J

Yao

Q

,

Nonlinear Time Series: Nonparametric and Parametric Methods

,

2003

New York

Springer

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

62

Cressie

N

Huang

H-C

,

Classes of nonseparable, spatio-temporal stationary covariance functions

,

J Am Stat Assoc

,

1999

, vol.

94

(pg.

1330

-

40

)

Google Scholar

Crossref

WorldCat

63

Gneiting

T

,

Nonseparable, stationary covariance functions for space-time data

,

J Am Stat Assoc

,

2002

, vol.

97

(pg.

590

-

600

)

Google Scholar

Crossref

WorldCat

64

Gneiting

T

Genton

M

Guttorp

P

Finkenstadt

B

Held

L

Isham

V

,

Geostatistical space-time models, stationary, separability and full symmetry

,

Statistical Methods for Spatio-temporal Systems, (Monographs on Statistics and Applied Probability)

,

2006

Boca Raton, Florida

Chapman and Hall/CRC

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

65

Yap

JS

Li

Y

Das

K

et al. ,

Functional mapping of reaction norms to multiple environmental signals through nonparametric covariance estimation

,

BMC Plant Biol

,

2011

, vol.

11

pg.

23

Download all slides

Month:	Total Views:
November 2016	6
December 2016	3
January 2017	2
February 2017	7
March 2017	5
April 2017	6
May 2017	5
June 2017	4
July 2017	2
August 2017	1
October 2017	7
November 2017	2
December 2017	15
January 2018	10
February 2018	5
March 2018	13
April 2018	14
May 2018	8
June 2018	8
July 2018	26
August 2018	11
September 2018	13
October 2018	9
November 2018	16
December 2018	14
January 2019	21
February 2019	16
March 2019	22
April 2019	17
May 2019	19
June 2019	13
July 2019	19
August 2019	11
September 2019	8
October 2019	11
November 2019	17
December 2019	14
January 2020	9
February 2020	11
March 2020	9
April 2020	12
May 2020	8
June 2020	15
July 2020	18
August 2020	15
September 2020	20
October 2020	14
November 2020	14
December 2020	12
January 2021	12
February 2021	12
March 2021	35
April 2021	10
May 2021	9
June 2021	6
July 2021	10
August 2021	7
September 2021	5
October 2021	17
November 2021	16
December 2021	14
January 2022	11
February 2022	9
March 2022	9
April 2022	25
May 2022	18
June 2022	24
July 2022	20
August 2022	15
September 2022	43
October 2022	86
November 2022	26
December 2022	59
January 2023	21
February 2023	9
March 2023	22
April 2023	30
May 2023	10
June 2023	5
July 2023	9
August 2023	18
September 2023	16
October 2023	17
November 2023	23
December 2023	21
January 2024	16
February 2024	4
March 2024	15
April 2024	20
May 2024	12
June 2024	15
July 2024	17
August 2024	14
September 2024	11
October 2024	19
November 2024	16
December 2024	8
January 2025	6
February 2025	7
March 2025	27
April 2025	5
May 2025	1

Article Contents

How to cluster gene expression dynamics in response to environmental signals

Abstract

INTRODUCTION

GENE-CLUSTERING FRAMEWORK

Statistical model

Structural modeling of mean vectors

Parametric modeling

Nonparametric modeling

Semiparametric modeling

Structural modeling of covariance

Autoregressive moving-average process(p,q)

Kernel smoothing

Modeling covariance over time and environment

Estimation and tests

WORKED EXAMPLE

Data analysis

Simulation

DISCUSSION

SUPPLEMENTARY DATA

FUNDING

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

How to cluster gene expression dynamics in response to environmental signals

Abstract

INTRODUCTION

GENE-CLUSTERING FRAMEWORK

Statistical model

Structural modeling of mean vectors

Parametric modeling

Nonparametric modeling

Semiparametric modeling

Structural modeling of covariance

Autoregressive moving-average process(p,q)

Kernel smoothing

Modeling covariance over time and environment

Estimation and tests

WORKED EXAMPLE

Data analysis

Simulation

DISCUSSION

SUPPLEMENTARY DATA

FUNDING

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only