The Genomes OnLine Database (GOLD) v.4: status of genomic and metagenomic projects and their associated metadata

Project and sequencing status definitions and number of projects

	Project/sequencing Status	Definition	Projects
1.	Complete	Genome project has been completed and the final sequence is deposited in INSDC	2907
	Finished	Completely sequenced and deposited in INSDC	1918
	Permanent Draft	Draft sequenced and deposited in INSDC	989
2.	Incomplete	Genome project is incomplete	7629
	Complete	Completely sequenced but not yet deposited in INSDC	25
	Draft	Draft sequenced and deposited in INSDC	1568
	In progress	Sequencing is in progress but no available sequence yet	3404
	DNA received	DNA has been received by sequencing center	211
	Awaiting DNA	DNA has not yet been received by sequencing center	437
3.	Targeted	Project is targeted, but has not yet been picked by any sequencing center	445

	Project/sequencing Status	Definition	Projects
1.	Complete	Genome project has been completed and the final sequence is deposited in INSDC	2907
	Finished	Completely sequenced and deposited in INSDC	1918
	Permanent Draft	Draft sequenced and deposited in INSDC	989
2.	Incomplete	Genome project is incomplete	7629
	Complete	Completely sequenced but not yet deposited in INSDC	25
	Draft	Draft sequenced and deposited in INSDC	1568
	In progress	Sequencing is in progress but no available sequence yet	3404
	DNA received	DNA has been received by sequencing center	211
	Awaiting DNA	DNA has not yet been received by sequencing center	437
3.	Targeted	Project is targeted, but has not yet been picked by any sequencing center	445

Table 1.

Project and sequencing status definitions and number of projects

	Project/sequencing Status	Definition	Projects
1.	Complete	Genome project has been completed and the final sequence is deposited in INSDC	2907
	Finished	Completely sequenced and deposited in INSDC	1918
	Permanent Draft	Draft sequenced and deposited in INSDC	989
2.	Incomplete	Genome project is incomplete	7629
	Complete	Completely sequenced but not yet deposited in INSDC	25
	Draft	Draft sequenced and deposited in INSDC	1568
	In progress	Sequencing is in progress but no available sequence yet	3404
	DNA received	DNA has been received by sequencing center	211
	Awaiting DNA	DNA has not yet been received by sequencing center	437
3.	Targeted	Project is targeted, but has not yet been picked by any sequencing center	445

	Project/sequencing Status	Definition	Projects
1.	Complete	Genome project has been completed and the final sequence is deposited in INSDC	2907
	Finished	Completely sequenced and deposited in INSDC	1918
	Permanent Draft	Draft sequenced and deposited in INSDC	989
2.	Incomplete	Genome project is incomplete	7629
	Complete	Completely sequenced but not yet deposited in INSDC	25
	Draft	Draft sequenced and deposited in INSDC	1568
	In progress	Sequencing is in progress but no available sequence yet	3404
	DNA received	DNA has been received by sequencing center	211
	Awaiting DNA	DNA has not yet been received by sequencing center	437
3.	Targeted	Project is targeted, but has not yet been picked by any sequencing center	445

As the cost of genome finishing has not yet dropped proportionally to the drop in cost of sequencing, an increasing number of sequencing projects are completed at the draft stage. GOLD is now monitoring those types of projects and a distinction is made between finished and permanent draft projects, while both are presented under the complete genome projects list. For all of these projects, the genome sequence is ‘completed’ by depositing the final version of the project in one of the public archival databases such as GenBank (17), European Molecular Biology Laboratory (EMBL) (18) and DNA Data Bank of Japan (DDBJ) (19). Again each number in the table of the sequencing status distribution is linked to the entire list of the projects under the specific category, which has an embedded Google Map API with the geographic location of the project, where available.

PHYLOGENETIC DISTRIBUTION

Following the link of phylogenetic distribution, the user can acquire a GOLD data breakdown of the number of classified subdivisions with genome projects (for each of the five main taxonomic levels) over number of the classified subdivisions of each phylogenetic group (Table 2).

Table 2.

Number of classified subdivisions with genome projects over the number of classified subdivisions of this phylogenetic group and the coverage of genome projects per taxonomic level

Domain	Projects		Phyla		Class		Order		Family		Genus
Domain	2011	2009	2011	2009	2011	2009	2011	2009	2011	2009	2011	2009
Archaea	327	179	5/5	5/5	10/10	10/10	18/18	18/18	28/29	24/26	96/118	85/109
Percentage coverage			100	100	100	100	100	100	97	92	81	78
Bacteria	8458	4184	32/34	27/29	51/53	45/47	109/118	234/281	254/298	234/281	885/2106	730/1930
Percentage coverage			94	93	100	96	92	83	85	83	42	38
Eukarya	2205	1280	33/57	29/55	93/182	80/188	258/1037	350/6288	458/6689	350/6288	729/54 K	536/48 K
Percentage coverage			58	53	51	43	25	6	7	6	1	1

Domain	Projects		Phyla		Class		Order		Family		Genus
Domain	2011	2009	2011	2009	2011	2009	2011	2009	2011	2009	2011	2009
Archaea	327	179	5/5	5/5	10/10	10/10	18/18	18/18	28/29	24/26	96/118	85/109
Percentage coverage			100	100	100	100	100	100	97	92	81	78
Bacteria	8458	4184	32/34	27/29	51/53	45/47	109/118	234/281	254/298	234/281	885/2106	730/1930
Percentage coverage			94	93	100	96	92	83	85	83	42	38
Eukarya	2205	1280	33/57	29/55	93/182	80/188	258/1037	350/6288	458/6689	350/6288	729/54 K	536/48 K
Percentage coverage			58	53	51	43	25	6	7	6	1	1

Table 2.

Open in new tab Download slide

Number of classified subdivisions with genome projects over the number of classified subdivisions of this phylogenetic group and the coverage of genome projects per taxonomic level

Domain	Projects		Phyla		Class		Order		Family		Genus
Domain	2011	2009	2011	2009	2011	2009	2011	2009	2011	2009	2011	2009
Archaea	327	179	5/5	5/5	10/10	10/10	18/18	18/18	28/29	24/26	96/118	85/109
Percentage coverage			100	100	100	100	100	100	97	92	81	78
Bacteria	8458	4184	32/34	27/29	51/53	45/47	109/118	234/281	254/298	234/281	885/2106	730/1930
Percentage coverage			94	93	100	96	92	83	85	83	42	38
Eukarya	2205	1280	33/57	29/55	93/182	80/188	258/1037	350/6288	458/6689	350/6288	729/54 K	536/48 K
Percentage coverage			58	53	51	43	25	6	7	6	1	1

Domain	Projects		Phyla		Class		Order		Family		Genus
Domain	2011	2009	2011	2009	2011	2009	2011	2009	2011	2009	2011	2009
Archaea	327	179	5/5	5/5	10/10	10/10	18/18	18/18	28/29	24/26	96/118	85/109
Percentage coverage			100	100	100	100	100	100	97	92	81	78
Bacteria	8458	4184	32/34	27/29	51/53	45/47	109/118	234/281	254/298	234/281	885/2106	730/1930
Percentage coverage			94	93	100	96	92	83	85	83	42	38
Eukarya	2205	1280	33/57	29/55	93/182	80/188	258/1037	350/6288	458/6689	350/6288	729/54 K	536/48 K
Percentage coverage			58	53	51	43	25	6	7	6	1	1

For example, the values 96/118 of archaeal genome projects at the Genus level in 2011 correspond to 96 archeal genera with genome projects of a total of 118 genera described. The NCBI taxonomic information is used to generate the information of known classified subdivisions per taxonomic level. The corresponding information from the 2009 GOLD release is also provided for comparative purposes for each taxonomic level as well as the percentage of coverage of the classified subdivisions per taxonomic level. It is interesting to note that the genome project coverage at the Genus level for Archaea has reached 81%, while for Bacteria and Eukarya, the genome coverage stands at 42% and 1%, respectively. It is also interesting to observe that while the number of new species and genera has been steadily increasing in all three domains during the last two years, the rate of coverage of each taxonomic subdivision with genome projects is growing at an even faster pace. An interactive graphic display (pie chart) and an interactive table for each of the different classes of organisms are available from this table in GOLD.

The sampling bias of genome sequencing projects favoring the three major bacterial lineages (Proteobacteria, Firmicutes and Actinobacteria) in 2009 (5) has actually increased during the last couple of years from 80% to 86% today despite the large-scale sequencing efforts, such as the GEBA (7) and HMP (6), which target previously neglected phylogenetic lineages (Figure 1D). Clearly, there remains much room for improvement of the phylogenetic coverage here, and further progress can be expected with similar large-scale biodiversity sequencing efforts scaling up as well as with the increase in uncultured genome projects.

METAGENOMICS DATA

Metagenome studies and metagenome samples

During the past 2 years, a growing number of metagenomic studies were added on GOLD. The database currently reports 340 studies associated with 1927 samples an almost 1.5-fold increase compared with the 200 distinct metagenomic studies (previously called projects) in 2009 and a fourfold increase in samples compared with the 453 of 2009. To facilitate the visualization of metagenome samples independently of the metagenome studies with which they are associated, we introduced a new GOLD ID for samples, the ‘Gs’. Moreover, we provided in the main page a new shortcut button for the interactive listing of all metagenomic samples, a Google Maps API marking the geographic location of their isolation and a new advanced metagenome search option under the Search Gold link.

Metagenome naming and classification

During project registration, it is critical to ensure that both study and samples follow the standardized naming convention and classification scheme as were previously described (10). The standardized metagenome naming convention consists of four major components analogous to the schema used for naming isolate organisms (i.e. genus, species, subspecies and strain). These are:

Habitat: used to provide a specification of the study/sample habitat, e.g. sediment, soil, marine, termite gut, wastewater, etc.
Community: Specification of the microbial community sampled, e.g. microbial/bacterial, viral, archaeal or other.
Location: Specification of study/sample location, e.g. black sea, etolikon lagoon, healthy adults, etc. Geographic longitude and latitude for environmental samples are required as MIMS (minimum information about metagenomic sequence/sample) (20)
Identifier: Specification of study/sample identifier, which describes anything that can identify the specific type of the community such as chlorotrophic, anoxygenic, time-series, thermal gradient, etc.

Accordingly, a study that examines viruses from the waters of the Black Sea that are acidified, will be named ‘Marine viral communities from the Black Sea, under conditions of ocean acidification’, as opposed to ‘Metagenome from viruses in acidified waters’ while a study examining microbial communities from sludge in bioreactors at University of California, Davis, will be named ‘Wastewater microbial communities from Enhanced Biological Phosphorus Removal (EBPR) bioreactor at University of California, Davis’, instead of ‘US sludge’. According to these ‘naming rules’ studies and samples are named and classified with the newly implemented hierarchical classification schema. The list of all metagenome studies and samples in GOLD according to this hierarchy is available online under the ‘metagenome classification’ link in the main page. The classification is expanding according to new Metagenome studies and based on users’ requests, and has already been adopted from the IMG with microbiome samples (IMG/M) database (11). Moreover, as of this year, the metagenome samples submitted for annotation and integration into IMG/M are no longer processed without prior standardized metagenome naming and classification or availability of minimum metadata information. We hope that in this way, we will improve the quality of the metagenomic data, and by effect advance data exploration through more accurate sample identification and selection.

The standardized classification scheme implemented in GOLD represents the first and so far the only proposed classification system for metagenomes. The five levels of classification are available for every study or sample through either the study or sample lists or under the metagenome search. The five levels are comprised of (i) Ecosystem (e.g. environmental, host-associated or engineered); (ii) Ecosystem category (e.g. terrestrial, aquatic, air, wastewater, food production, Human, Arthropoda, etc.); (iii) Ecosystem type (e.g. Freshwater, soil, respiratory system, skin, etc.); (iv) Ecosystem subtype (e.g. grass, oral, groundwater, etc.); and (v) Specific ecosystem (e.g. fecal, cave water, etc.) all of which provide a further division of the particular categories from which metagenomic samples are isolated.

The three ecosystem type categories under which all metagenome studies are classified currently hold 197 environmental studies when compared to 137 in 2009, 112 host-associated studies compared to 53 in 2009 and 30 engineered studies compared to 10 in 2009 (Figure 1C). Moreover, there are now 609 samples under environmental studies, 1067 under host-associated and 70 under engineered. The user can be directed to the interactive list of all metagenome samples, by selecting the corresponding shortcut link in the main page (Figure 3). The header of this list provides the breakdown of metagenome samples according to Engineered, Environmental and Host-associated projects. The main table of the list provides the GOLD ID ‘Gs’ for each sample, the sample name, the GOLD ID ‘Gm’ for the study to which the sample belongs, as well as the metagenome classification of the sample (Ecosystem, Ecosystem category, Ecosystem type, Ecosystem subtype and Specific ecocystem). The table also provides the name of the sequencing center that has undertaken the sequencing of the particular sample as well as the sample's size and its sequencing status. The data in the list can be sorted in ascending or descending order, by clicking any of its headers.

Figure 3.

Metagenome sample list.

Beyond the interface: metadata collection and management system

The metagenome sample associated metadata have also undergone significant expansion in GOLD during the last 2 years. The number of metadata associated categories for metagenome sample description has increased from one in the previous release to four in GOLD v.4. These categories now include: (i) sample information, (ii) sequencing information, (iv) environmental metadata and (v) host metadata. Accordingly, the particular characteristics of metagenome samples have been decoupled from the metagenome studies, which can be quite broad and encompassing several classes of metagenome samples, such as environmental and host associated samples, or environmental samples collected from both marine and terrestrial environments. Similarly, the number of metadata fields assigned to genome projects has grown during the last 2 years. A large number of the metadata fields have been populated for all or most of the projects, while some fields (particularly newer ones) are yet to be curated for the majority of the projects. Special emphasis in the curation was given in ensuring that most metagenomic samples have isolation information and that most of the genomic projects have either environment or host metadata associated with them.

GOLD DATA OVERVIEW AND CONCLUSIONS

Genome project registration in GOLD has been steadily increasing over time with an average of 2-fold increase every 2 years for the past 14 years (Figure 1A). The microbial genome projects have been carrying the majority of that increase, yet as it is obvious from Table 1, the taxonomic groups with no registered projects have only slightly decreased in the past 2 years. In our last report of the database in 2009 (5), we had predicted the exact number of projects to be sequenced within the years 2010–2015 (11 000 drafts) based on a conservative approach; however, with a more radical one, we had predicted that this would be achieved by 2012. This milestone is already achieved and surpassed in 2011, at least in the number of projects. However, the total number of draft genomes is almost 3000, a strong 3-fold increase since 2009. We had also foreseen a higher number of finished genomes by the end of the first half of the decade (3000 finished projects). This number has almost doubled since 2009 from 1000 to 1918 finished sequences. Based on the data that GOLD currently holds, we anticipate that the number of genome projects will continue doubling every 2 years. However, the community will be most likely orientated toward generating permanent draft rather than finished genome sequences due to the cost efficiency of such strategies.

In the last 2 years, we also witnessed a 4-fold increase in the number of registered metagenome samples. We expect that this trend will expand in the next 2 years, possibly reaching even a 20-fold increase of metagenomic samples and a 10-fold increase in uncultured or single cell organisms. Grand scale sequencing initiatives that are currently launched such as the Earth Microbiome Project (EMP) (21), targeting the sequencing of thousands of samples, are promising to maintain this growth.

FUTURE DEVELOPMENTS AND CHALLENGES

GOLD continues to evolve to a universal catalog tracking genomic and metagenomic projects and associated metadata, while increasing in scope and complexity. It is, therefore, charged not only with the task of providing data interconnectivity, exchange and dissemination but also with establishing genomic standards, enforcing them and making sure that the community abides to them. We anticipate that the GOLD metagenome naming and classification will become extremely important for most scientists in the field in the coming years.

Due to its nature, GOLD happens to be in the middle of emerging trends in science and technology, such as ‘big data’ engineering, where the size of the data itself becomes part of the problem, and ‘data science’ that tackles the problem of integrating data from all sorts of resources and encompasses anything from statistics, machine learning, computer science and art. Big data demands all inclusive data platforms (not just internal repositories) that enable synchronization with other bioinformatics databases. These have to go beyond the relational database model, with flexible schemas such as NoSQL or nonrelational databases that provide eventual not absolute consistency and enable agile data analysis (e.g. Hadoop). Moreover, statistics and visualization are keys to data conditioning and analysis for such large data sets and packages such as GnuPlot are becoming crucial for getting insight in data trends and future trajectories. In the next years, GOLD will have to implement such database schemas and visualization capabilities not only to advance the scientific research in the field but to spearhead the front of bioinformatics databases.

FUNDING

Director, Office of Science, Office of Biological and Environmental Research, Life Sciences Division, US Department of Energy (DE-AC02-05CH11231); National Energy Research Scientific Computing Center, Office of Science of the US Department of Energy (DE-AC02-05CH11231); Genome and metagenomes projects and metadata associated with human host associated environment, US National Institutes of Health Data Analysis and Coordination Center (U01-HG004866). Funding for open access charge: University of California.

Conflict of interest statement. None declared.

ACKNOWLEDGEMENTS

We thank the members of the microbial genomics and metagenomics programs at the JGI for support, useful discussions and exchange of ideas. We would also like to thank Michelle Gwinn Giglio and Heather Huot from University of Maryland for valuable feedback and interactions for the Human microbiome related projects, as well as Lynn Schriml for feedback and help for the EnvO ontology.

REFERENCES

Kyrpides

Genomes OnLine Database (GOLD 1.0): a monitor of complete and ongoing genome projects world-wide

Bioinformatics

1999

, vol.

(pg.

773

774

)

Bernal

Ear

Kyrpides

Genomes Online Database (GOLD): a Monitor of genome projects world-wide

Nucleic Acid Res.

2001

, vol.

(pg.

126

127

)

Liolios

Tavernarakis

Hugenholtz

Kyrpides

The Genomes On Line Database (GOLD) v.2: a monitor of Genome Projects world-wide

Nucleic Acid Res.

2006

, vol.

(pg.

D332

D334

)

Liolios

Mavromatis

Tavernarakis

Kyrpides

The Genomes OnLine Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata

Nucleic Acid Res.

2007

, vol.

(pg.

D475

D479

)

Liolios

Chen

Mavromatis

Tavernarakis

Hugenholtz

Markowitz

Kyrpides

The Genomes OnLine Database (GOLD) in 2009: status of genomic and metagenomic projects and their associated metadata

Nucleic Acid Res.

2010

, vol.

(pg.

D346

D354

)

Nelson

Weinstock

Highlander

Worley

Creasy

Wortman

Rusch

Mitreva

Sodergren

Chinwalla

et al. ,

A catalog of reference genomes from the human microbiome

Science

2010

, vol.

328

(pg.

994

999

)

Hugenholtz

Mavromatis

Pukall

Dalin

Ivanova

Kunin

Goodwin

Tindall

et al. ,

A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea

Nature

2009

, vol.

462

(pg.

1056

1060

)

Field

Amaral-Zettler

Cochrane

Cole

Dawyndt

Garrity

Gilbert

Glöckner

Hirschman

Karsch-Mizrachi

et al. ,

The Genomic Standards Consortium

PLoS Biol.

2011

, vol.

pg.

e1001088

Yilmaz

Kottmann

Field

Knight

Cole

Amaral-Zettler

Gilbert

Karsch-Mizrachi

Johnston

Cochrane

et al. ,

Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications

Nat. Biotechnol.

2011

, vol.

(pg.

415

420

)

Ivanova

Tringe

Liolios

Liu

Morrison

Hugenholtz

Kyrpides

A call for standardized classification of metagenome projects

Environ Microbiol.

2010

, vol.

(pg.

1803

1805

)

Markowitz

Ivanova

Szeto

Palaniappan

Chu

Dalevi

Chen

I-MA

Grechkin

Dubchak

Anderson

et al. ,

IMG/M: a data management and analysis system for metagenomes

Nucleic Acids Res.

2008

, vol.

(pg.

D534

D538

)

Markowitz

Chen

I-MA

Palaniappan

Chu

Szeto

Grechkin

Rather

Anderson

Lykidis

Mavromatis

et al. ,

The Integrated Microbial Genomes (IMG) system: an expanding comparative analysis resource

Nucleic Acids Res.

2010

, vol.

(pg.

D382

D390

)

Markowitz

Mavromatis

Ivanova

Chen

I-MA

Chu

Kyrpides

Expert IMG ER: a system for microbial genome annotation expert review and curation

Bioinformatics

2009

, vol.

(pg.

2271

2278

)

Birney

Hudson

Green

Gunter

Eddy

Rogers

Harris

Ehrlich

Apweiler

Austin

et al. ,

Prepublication data sharing

Nature

2009

, vol.

461

(pg.

168

170

)

Kyrpides

Fifteen years of microbial genomics: meeting the challenges and fulfilling the dream

Nat. Biotechnol.

2009

, vol.

(pg.

627

632

)

Garrity

Field

Kyrpides

Standards in genomic sciences

Stand. Genomic Sci.

2009

, vol.

(pg.

)

Benson

Karsch-Mizrachi

Lipman

Ostell

Sayers

GenBank

Nucleic Acids Res.

2011

, vol.

(pg.

D32

D37

)

Leinonen

Akhtar

Birney

Bower

Cerdeno-Tarrage

Cheng

Cleland

Faruque

Goodgame

Gibson

et al. ,

The European nucleotide archive

Nucleic Acids Res.

2011

, vol.

(pg.

D28

D31

)

Kaminuma

Kosuge

Kodama

Aono

Mashima

Gojobori

Sugawara

Ogasawara

Takagi

Okubo

et al. ,

DDBJ progress report

Nucleic Acids Res.

2011

, vol.

(pg.

D22

)

Glass

Meyer

Gilbert

Field

Hunter

Kottmann

Kyrpides

Sansone

Schriml

Sterk

et al. ,

Meeting report from the genomic standards consortium (GSC) workshop 10

Stand Genomic Sci.

2010

, vol.

(pg.

225

)

Gilbert

Meyer

Jansson

Gordon

Pace

Tiedje

Ley

Fierer

Field

Kyrpides

et al. ,

The Earth Microbiome Project: Meeting report of the ``1 EMP meeting on sample selection and acquisition'' at Argonne National Laboratory

Stand Genomic Sci.

2011

, vol.

(pg.

249

253

)