Article Navigation

Journal Article

SiGN-SSM: open source parallel software for estimating gene networks with state space models

Author Notes

Abstract

Summary: SiGN-SSM is an open-source gene network estimation software able to run in parallel on PCs and massively parallel supercomputers. The software estimates a state space model (SSM), that is a statistical dynamic model suitable for analyzing short time and/or replicated time series gene expression profiles. SiGN-SSM implements a novel parameter constraint effective to stabilize the estimated models. Also, by using a supercomputer, it is able to determine the gene network structure by a statistical permutation test in a practical time. SiGN-SSM is applicable not only to analyzing temporal regulatory dependencies between genes, but also to extracting the differentially regulated genes from time series expression profiles.

Availability: SiGN-SSM is distributed under GNU Affero General Public Licence (GNU AGPL) version 3 and can be downloaded at http://sign.hgc.jp/signssm/. The pre-compiled binaries for some architectures are available in addition to the source code. The pre-installed binaries are also available on the Human Genome Center supercomputer system. The online manual and the supplementary information of SiGN-SSM is available on our web site.

Contact: [email protected]

1 INTRODUCTION

Analyzing the dynamical regulatory mechanisms of gene expressions in a cellular system is a challenging problem in systems biology. To this end, many computational methods have been proposed to estimate dynamical systems of regulatory dependencies between gene expressions from temporal gene expression profiles. The major difficulty of these studies comes from insufficient data time points as opposed to the number of variables (genes) in a computational model. A state space model (SSM) (Hirose et al., 2008; Kitagawa and Gersch, 1996; West and Harrison, 1997) is a statistical model that is applicable to small time-point temporal datasets because it can reduce the number of parameters to be estimated. The SSM decomposes the temporal gene expressions into a dynamical system of modules called the system model (or state space) and a mapping from the modules to the particular genes called the observation model. There are a number of gene network studies using the SSM (Beal et al., 2005; Hirose et al., 2008; Rangel et al., 2004).

SiGN-SSM is a re-implemented, new version of the previously released one called TRANS-MNET (Hirose et al., 2008). In addition to TRANS-MNET, SiGN-SSM has the following improvements: (i) it implements a novel constraint on the model parameters effective to stabilize the estimated models for the short time series data with irregular time intervals; (ii) runs in parallel as a multithreaded program exploiting multicore CPUs, as a bulk (array) job through a job dispatching system such as Sun (Oracle) Grid Engine (SGE) on PC cluster systems, and a multi-process MPI (Message Passing Interface) application on massively parallel supercomputers; (iii) is an open-source software so that everyone can freely access the source code and improve, modify and distribute it, and (iv) implements a statistical permutation test to determine a gene network structure as proposed in Hirose et al. (2008) and differentially regulated gene extraction presented in Yamaguchi et al. (2008) on the Human Genome Center (HGC) supercomputer system.

2 STATE SPACE MODEL

We introduce the SSM briefly. For detailed definitions, see Hirose et al. (2008). Let y_n be a vector of p variables representing gene expression profiles for p genes observed at time n ≤ N, where N is the total number of time points. In the SSM, the observable expression profiles y_n are assumed to be generated from the k-dimensional hidden state variables (vector) x_n. Here, we assume that k ≪ p. The SSM is defined by the following two formulae:

where F is the (k, k)-state transition matrix and H the (p, k)-observation matrix that maps from the state variables x_n to the observation variables y_n. The two vectors v_n and w_n are the system and the observation noise, respectively, where Q and R are the covariance matrices of the normal distributions. The initial state variables x₀ is required to be estimated from the data and we assume x₀ ∼ N(μ₀, Σ₀). The SSM estimation problem is to estimate the unknown parameters {H, F, Q, R, μ₀} from the observed temporal gene expression data. The dimensions of the state vector (k) is also an unknown parameter to be determined. The parameter estimation from the observed data is realized by the expectation-maximization (EM) algorithm. The dimensions of the state vector, k, can be determined by comparing the Bayesian Information Criterion values of the estimated models with different k. Figure 1 shows the conceptual view of the SSM.

Fig. 1.

Conceptual view of the state space model.

Open in new tab Download slide

3 IMPLEMENTATION AND PARALLELIZATION

SiGN-SSM is written in C, using the BLAS/LAPACK library. Since the EM algorithm finds only locally optimal parameters, it is required to run the algorithm many times to obtain better estimate for a single k. To speed up the parameter estimation and the optimal k determination, SiGN-SSM supports multiprocess parallelization using MPI, multithread parallelization using OpenMP and parallelization by bulk jobs on PC clusters. By using MPI, we confirmed that SiGN-SSM can optimize multiple k in parallel very efficiently with up to 256 CPU cores. See Supplementary information on our web site for detailed results. The permutation test determines the gene network structure from the estimated model parameters. However, it requires much more computational time than the parameter estimation. To solve this problem, we parallelized it on the HGC supercomputer using SGE.

4 NEW CONSTRAINT ON PARAMETERS

When the algorithm estimates the model parameters from short time series data measured for irregular time intervals, they often oscillate undesirably (Fig. 2). To suppress such spurious patterns, we propose a novel constraint on the system transition matrix F along with the smoothness prior approach (Kitagawa and Gersch, 1996). With the constraint, we assume that the value of the state vector at time n is similar to that at time n − 1. The constrained version of F, denoted by ⁠, has its diagonal elements for i = 1,…, k where 0 ≤ g ≤ 1 is a constant to control the smoothness, which user can choose (we set 0.8 as the default value). The off-diagonal components of are estimated by the EM algorithm with the constraint of the diagonal components, in which we utilize the general framework of Wu et al. (1996) to constrain parameters of an SSM in the EM algorithm. The value of g can be determined by such as comparison of BIC values, cross-validation, etc.

Fig. 2.

Comparison without (left) and with (right) the proposed constraint on F. The plotted lines are the estimated smoothing observation variables for the very short, triplicate sample data.

Open in new tab Download slide

5 CONCLUSION

SiGN-SSM is a highly-scalable, open-source implementation of an SSM estimation algorithm. The newly proposed constraint on the parameters can significantly stabilize the estimation of the parameters for the very short time point temporal data with irregular time intervals. Users can analyze their time series expression datasets using SiGN-SSM, and the estimated model can be applied to other data to extract differentially expressed genes.

ACKNOWLEDGEMENT

Computational resources required for the development of SiGN-SSM was provided by the HGC Supercomputer System, Human Genome Center, Institute of Medical Science, The University of Tokyo and RIKEN Supercomputer system RICC.

Funding: ISLiM (Next-generation integrated simulation of living matter) project in RIKEN Computational Science Research Program.

Conflict of Interest: none declared.

REFERENCES

Beal

M.J.

et al. ,

A Bayesian approach to reconstructing genetic regulatory networks with hidden factors

Bioinformatics

2005

, vol.

(pg.

349

356

)

Hirose

et al. ,

Statistical inference of transcriptional module-based gene networks from time course gene expression profiles by using state space models

Bioinformatics

2008

, vol.

(pg.

932

942

)

Kitagawa

Gersch

Smoothness Priors Analysis of Time Series

1996

New York

Springer

Rangel

et al. ,

Modelling T-cell activation using gene expression profiling and state space models

Bioinformatics

2004

, vol.

(pg.

1361

1372

)

West

Harrison

Bayesian Forecasting and Dynamic Models

1997

New York

Springer

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

L.S.-Y.

et al. ,

An algorithm for estimating parameters of state-space models

Stat. Probab. Lett.

1996

, vol.

(pg.

106

)

Google Scholar

Crossref

WorldCat

Yamaguchi

et al. ,

Predicting differences in gene regulatory systems by state space models

Genome Inform.

2008

, vol.

(pg.

101

113

)

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

Author notes

Associate Editor: Martin Bishop

Download all slides

Views

975

Altmetric

Total Views 975

752 Pageviews

223 PDF Downloads

Since 12/1/2016

Month:	Total Views:
December 2016	3
January 2017	2
February 2017	4
March 2017	3
May 2017	7
June 2017	6
July 2017	3
August 2017	10
September 2017	1
October 2017	11
November 2017	2
December 2017	27
January 2018	14
February 2018	3
March 2018	24
April 2018	12
May 2018	7
June 2018	7
July 2018	10
August 2018	14
September 2018	21
October 2018	23
November 2018	12
December 2018	17
January 2019	7
February 2019	7
March 2019	17
April 2019	20
May 2019	13
June 2019	2
July 2019	22
August 2019	8
September 2019	11
October 2019	23
November 2019	11
December 2019	14
January 2020	18
February 2020	3
March 2020	18
April 2020	12
May 2020	9
June 2020	10
July 2020	3
August 2020	3
September 2020	17
October 2020	13
November 2020	6
December 2020	9
January 2021	7
February 2021	7
March 2021	16
April 2021	6
May 2021	5
June 2021	2
July 2021	21
August 2021	6
September 2021	8
October 2021	4
November 2021	9
December 2021	18
January 2022	6
February 2022	5
March 2022	5
April 2022	9
May 2022	25
June 2022	4
July 2022	10
August 2022	18
September 2022	12
October 2022	13
November 2022	3
December 2022	7
January 2023	19
February 2023	6
March 2023	5
April 2023	7
May 2023	4
June 2023	6
July 2023	6
August 2023	10
September 2023	9
October 2023	12
November 2023	7
December 2023	6
January 2024	17
February 2024	24
March 2024	3
April 2024	12
May 2024	10
June 2024	9
July 2024	10
August 2024	7
September 2024	11
October 2024	3
November 2024	6
December 2024	3
January 2025	4
February 2025	4
March 2025	8
April 2025	2

Article Contents

SiGN-SSM: open source parallel software for estimating gene networks with state space models

Abstract

1 INTRODUCTION

2 STATE SPACE MODEL

3 IMPLEMENTATION AND PARALLELIZATION

4 NEW CONSTRAINT ON PARAMETERS

5 CONCLUSION

ACKNOWLEDGEMENT

REFERENCES

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

SiGN-SSM: open source parallel software for estimating gene networks with state space models

Abstract

1 INTRODUCTION

2 STATE SPACE MODEL

3 IMPLEMENTATION AND PARALLELIZATION

4 NEW CONSTRAINT ON PARAMETERS

5 CONCLUSION

ACKNOWLEDGEMENT

REFERENCES

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only