ABSTRACT

The Genomes On Line Database (GOLD) is a comprehensive resource for centralized monitoring of genome and metagenome projects worldwide. Both complete and ongoing projects, along with their associated metadata, can be accessed in GOLD through precomputed tables and a search page. As of September 2009, GOLD contains information for more than 5800 sequencing projects, of which 1100 have been completed and their sequence data deposited in a public repository. GOLD continues to expand, moving toward the goal of providing the most comprehensive repository of metadata information related to the projects and their organisms/environments in accordance with the Minimum Information about a (Meta)Genome Sequence (MIGS/MIMS) specification. GOLD is available at: http://www.genomesonline.org and has a mirror site at the Institute of Molecular Biology and Biotechnology, Crete, Greece, at: http://gold.imbb.forth.gr/

HISTORY AND GROWTH

The Genomes OnLine Database (GOLD) provides a centralized resource for the continuous monitoring of genome and metagenome sequencing projects worldwide, uniquely integrated with their associated metadata. Since its founding in 1997 (14), GOLD has grown dramatically, now hosting information regarding over 5800 sequencing projects (Figure 1A).

Statistical information available from GOLD. (A) Evolution of the complete and ongoing genome projects monitored in GOLD from December 1997 through September 2009. (B) Distribution of the 5831 genome projects across the major sequencing centers. Abbreviations: JGI, Joint Genome Institute; JCVI, J. Craig Venter Institute; Broad, Broad Institute; WashU, Washington University; Sanger, the Wellcome Trust Sanger Institute; BCM-HGSC, Baylor College of Medicine Human Genome Sequencing Center; WORLD, all other sequencing centers. (C) Distribution of the 200 current metagenome projects across the three major metagenome classification categories. (D) Phylogenetic distribution of the 4172 bacterial genome projects as of September 2009.
Figure 1.

Statistical information available from GOLD. (A) Evolution of the complete and ongoing genome projects monitored in GOLD from December 1997 through September 2009. (B) Distribution of the 5831 genome projects across the major sequencing centers. Abbreviations: JGI, Joint Genome Institute; JCVI, J. Craig Venter Institute; Broad, Broad Institute; WashU, Washington University; Sanger, the Wellcome Trust Sanger Institute; BCM-HGSC, Baylor College of Medicine Human Genome Sequencing Center; WORLD, all other sequencing centers. (C) Distribution of the 200 current metagenome projects across the three major metagenome classification categories. (D) Phylogenetic distribution of the 4172 bacterial genome projects as of September 2009.

The number of registered sequencing projects has doubled since the publication of the previous report two years ago (4). As of September 2009, 5843 projects have been recorded, versus 2905 as of September 2007 and 1575 as of September 2005 (3, 4). This rapid growth has been fueled by decreasing sequencing costs combined with technological advances, and was significantly augmented by the launching and successful execution of several large-scale microbial genome sequencing initiatives, e.g. the Human Microbiome Project (http://www.hmpdacc.org/) and the Genomic Encyclopedia of Bacteria and Archaea (http://www.jgi.doe.gov/programs/GEBA/). During this period, GOLD has also expanded its scope beyond standard genomic and metagenomic projects to now encompass data from the growing number of resequencing, transcriptome and metatranscriptome projects.

In parallel with this doubling in the number of genome projects has come an increase in the number of captured metadata fields from 56 in 2007 (4) to 135 today. This is an area of active development; thus, we anticipate further increases as more metadata types are described and captured in published studies. Some of the new metadata types are described below.

Among the most important developments of the database during the last 2 years are those coupled to the growth of the metadata. These include the implementation of GOLD-specific Controlled Vocabularies (CVs) for the representation of the associated data, as well as coordination with the Genomics Standards Consortium (GSC) (http://gensc.org/) and compliance with its recommendations for the Minimum Information about a (Meta)Genome Sequence (MIGS/MIMS) (5).

As the rate of launching new projects accelerates, the task of monitoring and recording their data along with their metadata grows ever more difficult. Therefore, the sequencing centers and the community at large are strongly encouraged to register their own sequencing projects in GOLD to ensure complete and accurate project tracking.

CURRENT STATUS OF GOLD

Published complete genomes

The year 2009 represents a landmark in the history of genome sequencing projects: the completed sequencing of the first 1000 genomes. As of September 2009, GOLD documents 1100 completed genome projects, a 1.7-fold increase from 2 years ago (4). These comprise 914 bacterial, 68 archaeal and 118 eukaryotic genomes. Thus, the completely sequenced archaeal and bacterial genomes currently total 982, leading one to confidently predict that the community will celebrate yet another 1000 genome milestone before the end of the year.

For all of these projects, the complete genome sequence is ‘published’ by being deposited in one of the public archival databases such as GenBank (6), EMBL (7) and DDBJ (8). However, a rapidly increasing proportion of the projects do not have an associated publication in the literature. That fraction currently stands at 37% (408 of 1100). This shift is partly attributable to the more frequent release of sequence data to the community prior to publication in compliance with the rapid pre-publication data release policies and recommendations (9). Another factor is the increase in larger-scale efforts that involve the parallel sequencing of several hundred organisms (e.g. the HMP and GEBA). Here, preparation of the typical detailed publication describing the genome of every single organism would be virtually impossible (4,10). This situation calls for a new mechanism that can provide a GSC-compliant citable record for every completed genome project and its metadata. To that end, an open access scientific journal, Standards in Genomic Sciences (SIGS), (http://standardsingenomics.org/) has recently been launched (11), its goal being to catalog and maintain the data from completed genome projects in an orderly and standardized manner (10).

In addition to publication of each complete genome sequence, GSC also strongly recommends that the source organism be available from a culture collection center. It is unfortunate that after so many years and so many genome sequences, the widely accepted policies for publication of genome sequencing projects require the submission to a public repository of only the sequence data, not also the biological material itself. As a result, from the current list of 982 completed archaeal and bacterial genomes, only 518 (53%) appear to be available from a culture collection center (12), and only half of those genomes (27% of the total) represent a type strain of the sequenced species.

Ongoing genome projects

In addition to the 1095 completed projects, there are currently 4543 ongoing sequencing projects, of which 3271 are bacterial, 110 archaeal and 1162 are eukaryotic. This total is more than double the 2158 reported 2 years ago. Until recently, the projects monitored for GOLD were predominantly ‘Genome’ and ‘EST’ sequencing projects, supplemented by a small number of ‘Genome-Surveys’ and ‘Genome-Regions’ (the latter representing some eukaryotic projects focused on specific genomic regions). The increasing number of ‘Resequencing’ and ‘Transcriptome’ projects prompted the addition of these two new project types during the past year (Table 1).

Table 1.

Project type distributiona

Archaea: 179Genome: 169Transcriptome: 0Resequencing: 0Uncultured: 9
Bacteria: 4184Genome: 4097Transcriptome: 4Resequencing: 35Uncultured: 14
Eukarya: 1280Genome: 804EST/Transcriptome: 344Resequencing: 45Uncultured: 1
Archaea: 179Genome: 169Transcriptome: 0Resequencing: 0Uncultured: 9
Bacteria: 4184Genome: 4097Transcriptome: 4Resequencing: 35Uncultured: 14
Eukarya: 1280Genome: 804EST/Transcriptome: 344Resequencing: 45Uncultured: 1
Table 1.

Project type distributiona

Archaea: 179Genome: 169Transcriptome: 0Resequencing: 0Uncultured: 9
Bacteria: 4184Genome: 4097Transcriptome: 4Resequencing: 35Uncultured: 14
Eukarya: 1280Genome: 804EST/Transcriptome: 344Resequencing: 45Uncultured: 1
Archaea: 179Genome: 169Transcriptome: 0Resequencing: 0Uncultured: 9
Bacteria: 4184Genome: 4097Transcriptome: 4Resequencing: 35Uncultured: 14
Eukarya: 1280Genome: 804EST/Transcriptome: 344Resequencing: 45Uncultured: 1

The current Sequencing Status distribution tallied by domain is shown in Table 2. The Sequencing Status designations and current tallies are as follows:

  • Complete: DNA sequencing has been completed; 288 projects in addition to the 1100 already published.

  • Draft: a draft sequence has been deposited in a public repository; 1164 projects.

  • In progress: the DNA sequence has been received by the sequencing center but there is not yet public data release; 442 projects.

  • Awaiting DNA: an organism selection has been made, but the DNA has not yet arrived at the DNA sequencing center; 236 projects.

  • Targeted: a project has been identified but further work has not yet begun; 527 projects.

Table 2.

Project status distributiona

Archaea: 179Complete: 74Draft: 16In Progress: 17Awaiting DNA: 7Targeted: 1
Bacteria: 4184Complete: 1151Draft: 950In Progress: 414Awaiting DNA: 142Targeted: 517
Eukarya: 1280Complete: 159Draft: 178In progress: 46Awaiting DNA: 73Targeted: 9
Archaea: 179Complete: 74Draft: 16In Progress: 17Awaiting DNA: 7Targeted: 1
Bacteria: 4184Complete: 1151Draft: 950In Progress: 414Awaiting DNA: 142Targeted: 517
Eukarya: 1280Complete: 159Draft: 178In progress: 46Awaiting DNA: 73Targeted: 9
Table 2.

Project status distributiona

Archaea: 179Complete: 74Draft: 16In Progress: 17Awaiting DNA: 7Targeted: 1
Bacteria: 4184Complete: 1151Draft: 950In Progress: 414Awaiting DNA: 142Targeted: 517
Eukarya: 1280Complete: 159Draft: 178In progress: 46Awaiting DNA: 73Targeted: 9
Archaea: 179Complete: 74Draft: 16In Progress: 17Awaiting DNA: 7Targeted: 1
Bacteria: 4184Complete: 1151Draft: 950In Progress: 414Awaiting DNA: 142Targeted: 517
Eukarya: 1280Complete: 159Draft: 178In progress: 46Awaiting DNA: 73Targeted: 9

The distributions of all projects by Project Type and by Sequencing Status are now dynamically tracked with every GOLD update and can be viewed online through the main page at: http://www.genomesonline.org/gold.cgi.

Metagenome projects

The past 2 years have seen a growing number of metagenomic projects added to GOLD, and the expectation is that this trend will continue, reinforced by further advances in the sequencing technology. The database currently reports 200 distinct metagenomic projects, embracing 453 samples.

During curation, careful attention is paid to ensure that project names follow the standardized schema previously described (4). All the metagenome projects are classified under three major categories: environmental (137 projects), endobiotic or host-associated (53 projects) and synthetic (10 samples) (Figure 1C). A project classification schema is also under development and will soon be released from the database. A prototype of this classification has already been adopted by the Integrated Microbial Genomes with Microbiome Samples (IMG/M) database (13) and is available for browsing online (http://img.jgi.doe.gov/cgi-bin/m/main.cgi?section=TaxonList&page=taxonListPhylo&domain=*Microbiome&genome_type=metagenome). A hierarchical classification scheme with all the metagenome projects captured in GOLD will soon be available from the database.

Metadata

The genome/metagenome associated metadata have also undergone significant expansion in GOLD during the last 2 years. The number of metadata categories has increased from two in the previous release to six in GOLD v.3: (i) organism information; (ii) project information; (iii) sequencing information; (iv) environmental metadata; (v) host metadata; and (vi) organism metadata. Likewise, the number of metadata fields assigned to those categories has grown from 56 to 135.

The current status of the different fields and the number of projects with associated data for each of the corresponding fields is shown in Table 3. Some of the metadata fields are populated for all or most of the projects, while other fields (particularly newer ones) are yet to be curated for the majority of the projects. Although the number of metadata fields is expected to continue to grow, the current list has already been put to use in microbial comparative analysis systems such as the Integrated Microbial Genomes IMG (14) and IMG/M (13).

Particularly important developments currently underway involve the integration and mapping of several of the available metadata fields in GOLD to well-developed publicly available metadata ontologies and control vocabularies such as ‘Habitat-Lite’ (15) and others.

Table 3.

Metadata categories and fields

1. Organism informationTypeNo. of projects2. Project informationTypeNo. of projects
1. GOLD display nameFT58431. GOLD project IDID5843
2. NCBI project NameFT34082. GCAT IDID5843
3. Common nameFT3643. NCBI project IDID3600
4. DomainCV58434. IMG IDID1664
5. PhylumCV56655. Cross reference IDID204
6. ClassCV53796. Greengenes IDID1994
7. OrderCV56087. 16S IDID17
8. FamilyCV53968. NCBI archive IDID15
9. GenusCV55709. Short read archive IDID117
10. SpeciesCV385610. Project typeCV5843
11. StrainFT474811. Project statusCV5843
12. SerovarFT38412. AvailabilityCV5843
13. NCBI taxon IDID569913. Contact nameFT4210
14. Culture collection IDFT171114. Contact emailFT3480
15. Type strainCV197015. Contact linkURL1034
16. Biosafety levelID26016. Funding programCV1612
17. Organism commentsFT1117. Proteomics dataFT2
18. Proteomics LinkURL2
3. Sequencing information19. Transcriptomics DataFT14
1. Sequencing StatusCV387020. Transcriptomics LinkURL5
2. Sequencing qualityCV32121. Locus TagFT1286
3. Seq status linkURL80022. GC percentFT2380
4. Library methodFT13423. Chromosome countID1259
5. Number of readsFT17324. Plasmid countID1223
6. VectorFT6525. Completion dateID1155
7. Assembly methodFT36826. PublicationCV1154
8. Sequencing depthFT127727. Project descriptionFT205
9. Gene calling methodFT26328. Project relevanceCV10 396
10. Contig countFT58329. Funding centerCV4450
11. Estimated sizeFT278039. Sequence dataID2794
12. Gene countFT199331. DatabaseCV5101
13. Sequencing countryCV5802
6. Organism metadata
4. Environmental metadata1. Oxygen requirementCV3797
1. Isolation siteFT31882. Cell shapeCV3710
2. Source of isolateFT6093. MotilityCV3435
3. Method of isolationFT1414. SporulationCV2610
4. Isolation commentsFT1345. Temperature RangeCV4422
5. Collection dateFT4266. Temperature optimumID1319
6. Isolation countryCV13457. SalinityCV131
7. Isolation Pubmed IDID1048. pHID180
8. Geographic iocationFT21389. Cell diameterFT68
9. LatitudeFT76910. Cell lengthFT56
10. LongitudeFT76811. ColorCV44
11. AltitudeFT1612. Gram stainingCV4229
12. DepthFT19313. Biotic relationshipsCV4244
14. Symbiotic physical interactionCV135
5. Host metadata15. Symbiotic relationshipCV182
1. Host nameFT202916. Symbiont nameFT156
2. Host genderFT21917. Cell arrangementCV1897
3. Host raceFT318. DiseasesCV5303
4. Host ageFT14319. HabitatCV7214
5. Host healthFT36320. MetabolismCV21
6. Host medicationFT221. PhenotypesCV3345
7. Primary body sample siteCV164322. Energy sourceCV1439
8. Body sample subsiteCV533
9. Body productCV412
10. Additional body sample siteCV18
1. Organism informationTypeNo. of projects2. Project informationTypeNo. of projects
1. GOLD display nameFT58431. GOLD project IDID5843
2. NCBI project NameFT34082. GCAT IDID5843
3. Common nameFT3643. NCBI project IDID3600
4. DomainCV58434. IMG IDID1664
5. PhylumCV56655. Cross reference IDID204
6. ClassCV53796. Greengenes IDID1994
7. OrderCV56087. 16S IDID17
8. FamilyCV53968. NCBI archive IDID15
9. GenusCV55709. Short read archive IDID117
10. SpeciesCV385610. Project typeCV5843
11. StrainFT474811. Project statusCV5843
12. SerovarFT38412. AvailabilityCV5843
13. NCBI taxon IDID569913. Contact nameFT4210
14. Culture collection IDFT171114. Contact emailFT3480
15. Type strainCV197015. Contact linkURL1034
16. Biosafety levelID26016. Funding programCV1612
17. Organism commentsFT1117. Proteomics dataFT2
18. Proteomics LinkURL2
3. Sequencing information19. Transcriptomics DataFT14
1. Sequencing StatusCV387020. Transcriptomics LinkURL5
2. Sequencing qualityCV32121. Locus TagFT1286
3. Seq status linkURL80022. GC percentFT2380
4. Library methodFT13423. Chromosome countID1259
5. Number of readsFT17324. Plasmid countID1223
6. VectorFT6525. Completion dateID1155
7. Assembly methodFT36826. PublicationCV1154
8. Sequencing depthFT127727. Project descriptionFT205
9. Gene calling methodFT26328. Project relevanceCV10 396
10. Contig countFT58329. Funding centerCV4450
11. Estimated sizeFT278039. Sequence dataID2794
12. Gene countFT199331. DatabaseCV5101
13. Sequencing countryCV5802
6. Organism metadata
4. Environmental metadata1. Oxygen requirementCV3797
1. Isolation siteFT31882. Cell shapeCV3710
2. Source of isolateFT6093. MotilityCV3435
3. Method of isolationFT1414. SporulationCV2610
4. Isolation commentsFT1345. Temperature RangeCV4422
5. Collection dateFT4266. Temperature optimumID1319
6. Isolation countryCV13457. SalinityCV131
7. Isolation Pubmed IDID1048. pHID180
8. Geographic iocationFT21389. Cell diameterFT68
9. LatitudeFT76910. Cell lengthFT56
10. LongitudeFT76811. ColorCV44
11. AltitudeFT1612. Gram stainingCV4229
12. DepthFT19313. Biotic relationshipsCV4244
14. Symbiotic physical interactionCV135
5. Host metadata15. Symbiotic relationshipCV182
1. Host nameFT202916. Symbiont nameFT156
2. Host genderFT21917. Cell arrangementCV1897
3. Host raceFT318. DiseasesCV5303
4. Host ageFT14319. HabitatCV7214
5. Host healthFT36320. MetabolismCV21
6. Host medicationFT221. PhenotypesCV3345
7. Primary body sample siteCV164322. Energy sourceCV1439
8. Body sample subsiteCV533
9. Body productCV412
10. Additional body sample siteCV18

Abbreviations for field types: ID, identity number; FT, free text; CV, control vocabulary; URL, uniform resource locator.

Table 3.

Metadata categories and fields

1. Organism informationTypeNo. of projects2. Project informationTypeNo. of projects
1. GOLD display nameFT58431. GOLD project IDID5843
2. NCBI project NameFT34082. GCAT IDID5843
3. Common nameFT3643. NCBI project IDID3600
4. DomainCV58434. IMG IDID1664
5. PhylumCV56655. Cross reference IDID204
6. ClassCV53796. Greengenes IDID1994
7. OrderCV56087. 16S IDID17
8. FamilyCV53968. NCBI archive IDID15
9. GenusCV55709. Short read archive IDID117
10. SpeciesCV385610. Project typeCV5843
11. StrainFT474811. Project statusCV5843
12. SerovarFT38412. AvailabilityCV5843
13. NCBI taxon IDID569913. Contact nameFT4210
14. Culture collection IDFT171114. Contact emailFT3480
15. Type strainCV197015. Contact linkURL1034
16. Biosafety levelID26016. Funding programCV1612
17. Organism commentsFT1117. Proteomics dataFT2
18. Proteomics LinkURL2
3. Sequencing information19. Transcriptomics DataFT14
1. Sequencing StatusCV387020. Transcriptomics LinkURL5
2. Sequencing qualityCV32121. Locus TagFT1286
3. Seq status linkURL80022. GC percentFT2380
4. Library methodFT13423. Chromosome countID1259
5. Number of readsFT17324. Plasmid countID1223
6. VectorFT6525. Completion dateID1155
7. Assembly methodFT36826. PublicationCV1154
8. Sequencing depthFT127727. Project descriptionFT205
9. Gene calling methodFT26328. Project relevanceCV10 396
10. Contig countFT58329. Funding centerCV4450
11. Estimated sizeFT278039. Sequence dataID2794
12. Gene countFT199331. DatabaseCV5101
13. Sequencing countryCV5802
6. Organism metadata
4. Environmental metadata1. Oxygen requirementCV3797
1. Isolation siteFT31882. Cell shapeCV3710
2. Source of isolateFT6093. MotilityCV3435
3. Method of isolationFT1414. SporulationCV2610
4. Isolation commentsFT1345. Temperature RangeCV4422
5. Collection dateFT4266. Temperature optimumID1319
6. Isolation countryCV13457. SalinityCV131
7. Isolation Pubmed IDID1048. pHID180
8. Geographic iocationFT21389. Cell diameterFT68
9. LatitudeFT76910. Cell lengthFT56
10. LongitudeFT76811. ColorCV44
11. AltitudeFT1612. Gram stainingCV4229
12. DepthFT19313. Biotic relationshipsCV4244
14. Symbiotic physical interactionCV135
5. Host metadata15. Symbiotic relationshipCV182
1. Host nameFT202916. Symbiont nameFT156
2. Host genderFT21917. Cell arrangementCV1897
3. Host raceFT318. DiseasesCV5303
4. Host ageFT14319. HabitatCV7214
5. Host healthFT36320. MetabolismCV21
6. Host medicationFT221. PhenotypesCV3345
7. Primary body sample siteCV164322. Energy sourceCV1439
8. Body sample subsiteCV533
9. Body productCV412
10. Additional body sample siteCV18
1. Organism informationTypeNo. of projects2. Project informationTypeNo. of projects
1. GOLD display nameFT58431. GOLD project IDID5843
2. NCBI project NameFT34082. GCAT IDID5843
3. Common nameFT3643. NCBI project IDID3600
4. DomainCV58434. IMG IDID1664
5. PhylumCV56655. Cross reference IDID204
6. ClassCV53796. Greengenes IDID1994
7. OrderCV56087. 16S IDID17
8. FamilyCV53968. NCBI archive IDID15
9. GenusCV55709. Short read archive IDID117
10. SpeciesCV385610. Project typeCV5843
11. StrainFT474811. Project statusCV5843
12. SerovarFT38412. AvailabilityCV5843
13. NCBI taxon IDID569913. Contact nameFT4210
14. Culture collection IDFT171114. Contact emailFT3480
15. Type strainCV197015. Contact linkURL1034
16. Biosafety levelID26016. Funding programCV1612
17. Organism commentsFT1117. Proteomics dataFT2
18. Proteomics LinkURL2
3. Sequencing information19. Transcriptomics DataFT14
1. Sequencing StatusCV387020. Transcriptomics LinkURL5
2. Sequencing qualityCV32121. Locus TagFT1286
3. Seq status linkURL80022. GC percentFT2380
4. Library methodFT13423. Chromosome countID1259
5. Number of readsFT17324. Plasmid countID1223
6. VectorFT6525. Completion dateID1155
7. Assembly methodFT36826. PublicationCV1154
8. Sequencing depthFT127727. Project descriptionFT205
9. Gene calling methodFT26328. Project relevanceCV10 396
10. Contig countFT58329. Funding centerCV4450
11. Estimated sizeFT278039. Sequence dataID2794
12. Gene countFT199331. DatabaseCV5101
13. Sequencing countryCV5802
6. Organism metadata
4. Environmental metadata1. Oxygen requirementCV3797
1. Isolation siteFT31882. Cell shapeCV3710
2. Source of isolateFT6093. MotilityCV3435
3. Method of isolationFT1414. SporulationCV2610
4. Isolation commentsFT1345. Temperature RangeCV4422
5. Collection dateFT4266. Temperature optimumID1319
6. Isolation countryCV13457. SalinityCV131
7. Isolation Pubmed IDID1048. pHID180
8. Geographic iocationFT21389. Cell diameterFT68
9. LatitudeFT76910. Cell lengthFT56
10. LongitudeFT76811. ColorCV44
11. AltitudeFT1612. Gram stainingCV4229
12. DepthFT19313. Biotic relationshipsCV4244
14. Symbiotic physical interactionCV135
5. Host metadata15. Symbiotic relationshipCV182
1. Host nameFT202916. Symbiont nameFT156
2. Host genderFT21917. Cell arrangementCV1897
3. Host raceFT318. DiseasesCV5303
4. Host ageFT14319. HabitatCV7214
5. Host healthFT36320. MetabolismCV21
6. Host medicationFT221. PhenotypesCV3345
7. Primary body sample siteCV164322. Energy sourceCV1439
8. Body sample subsiteCV533
9. Body productCV412
10. Additional body sample siteCV18

Abbreviations for field types: ID, identity number; FT, free text; CV, control vocabulary; URL, uniform resource locator.

NEW DEVELOPMENTS

New user interface implementing new technologies

The burgeoning array of new types of data recorded in GOLD necessitated a major revamping of the graphical user interface. The GOLD tables have been visually enhanced using advanced graphical technologies such as EXT JS JavaScript library for the grids, the Yahoo User Interface Library for the pie charts and data tables, the Google Maps API for geographical location display, Google MarkerClusterer for improved visual display of multiple map locations, and the JavaScript Object Notation (JSON) data format for rapid data loading.

On the main page (http://www.genomesonline.org/gold.cgi), three links have been added to connect to new pages displaying the current distribution of projects by type, sequencing status and phylogeny (Figure 2, right). On each of these new pages, the same technologies are used to convey key breakdown data in a visually intuitive manner. Below the links, the Google map is displayed showing all projects individually or in clusters (Figure 2, left). Clicking on a project displays information about the collection location, an image (if available), and a link to the project's GOLD CARD page.

Graphical displays in GOLD. (Left) Geographical display of the collection location for organisms and environmental samples. Click on a project to view the detailed information window showing the name of the project, an image (if available), a GOLD CARD link, and a short description identifying the location. (Right) Phylogenetic distribution of archaeal, bacterial and eukaryotic projects with accompanying data tables.
Figure 2.

Graphical displays in GOLD. (Left) Geographical display of the collection location for organisms and environmental samples. Click on a project to view the detailed information window showing the name of the project, an image (if available), a GOLD CARD link, and a short description identifying the location. (Right) Phylogenetic distribution of archaeal, bacterial and eukaryotic projects with accompanying data tables.

The same entry page provides access to the enhanced tables for the five major GOLD project categories (published complete genomes, archaeal ongoing genomes, bacterial ongoing genomes, eukaryotic ongoing genomes and metagenomes). Each table displays information for 12 primary metadata fields for each project. By default, projects are sorted by GOLDSTAMP ID, sequential numbers assigned in sequence as projects are entered in GOLD. To sort by the data in any other column, click the column header. To display advanced options, mouse over the column header and click to open the dropdown list. These options enable you to sort in ascending or descending order, to show/hide different columns and to filter the projects displayed based on data in that column.

The Search GOLD page has been completely rewritten. There are currently four tab pages, each corresponding to a different search mode and each offering new capabilities for more effective searching. The first tab, the basic search, provides commonly used Boolean queries for the most frequently searched fields in three main data categories. The Advanced Search tab offers a more extended list of search criteria from eight major data categories. The Metadata Search tab can be used to query the database metadata and view the results in tables and graphical displays of statistics and rankings. A fourth tab that is currently under development, Custom (SQL) Search, will enable users to construct and execute their own SQL queries. The aforementioned interface technologies are also employed here to provide an enhanced visual display of the search results and enable further manipulations. The user can export the search results to a Microsoft® Excel file or redirect them to the metadata analysis page. At that page, charts and statistics can be derived from the breakdown of the search results based on more than 40 metadata fields.

Finally, the GOLD CARD page has also been extensively redesigned, making for more intuitive navigation (Figure 3). Genome project data are now organized into seven major categories for easier access. Google map location and images of the organism(s) are provided when available. Empty data rows can be hidden by clicking the arrow located at the upper right corner of the card. The GOLD CARD page complies with the GSC standards (5) and provides IDs and links for all the compliant data fields. The list of metadata fields provided by GOLD, now more than 100, includes those currently part of the MIGS specifications plus many more that are now candidates for inclusion in the MIGS list.

The prefix in the GOLDSTAMP identifier assigned to each project encodes additional project information: Gc, GOLD complete; Gi, GOLD incomplete; Gm, GOLD metagenome; Ge, GOLD EST; Gr, GOLD resequencing; and Gt, GOLD transcriptome.

The GOLD CARD page. The GOLD CARD Page with the list of available metadata organized into six major categories. The corresponding MIGS/MIMS IDs are also shown for each GOLD field.
Figure 3.

The GOLD CARD page. The GOLD CARD Page with the list of available metadata organized into six major categories. The corresponding MIGS/MIMS IDs are also shown for each GOLD field.

Metadata collection and management system

The number of genome projects initiated is increasing exponentially, bringing with it an exponential increase in the task of curating the GOLD data. To help cope with this flood, a new project management system (IMG-GOLD) was created to interface between GOLD and the Integrated Microbial Genomes (IMG) system (14). IMG is a widely used community resource for comparative analysis of publicly available genome data. The Expert Review version of IMG (IMG ER) (16) allows users to enter their own genome sequence data sets so that they can review and curate the annotations prior to their public release. Metadata accompanying those genome data sets are now captured via the IMG ER submission site (http://img.jgi.doe.gov/submit) and recorded in the new IMG-GOLD system (http://img.jgi.doe.gov/gold). IMG-GOLD now serves not only as the database underlying GOLD, but also as the source of metadata for IMG and IMG ER (and their metagenome counterparts, IMG/M and IMG/M ER). An example of how the metadata from GOLD can support and be presented through a metagenome analysis system, such as IMG/M (13), is presented in Supplementary Data (Supplementary Data). We anticipate that similar data exchange and interoperability between GOLD and other analytical systems, such as RAST (17) and CAMERA (18), will be developed in the near future.

Other systems already powered by GOLD include the NIH-funded Human Microbiome Project Catalog (HMPC) provided through the Data Analysis and Coordination Center (DACC) (http://www.hmpdacc.org/). DACC connects directly to the GOLD database and accesses the HMP-specific data subset. To enable monitoring of the status of an HMP genome project, a new set of attributes and data types were added to GOLD and the already-existing controlled vocabularies were expanded. The HMPC page enables the DACC collaborators to choose and view targeted genome strains for sequencing. However, the community can also use this page to query the reference genomes and return profuse metadata.

IMG-GOLD also provides a web-based data entry mechanism that enables genome project submitters and curators to create/update/delete GOLD genome projects, provide associated metadata and create/edit controlled vocabularies for new metadata attributes. For users who prefer to provide metadata in file format, preformatted Excel spreadsheets are provided on the GOLD site (http://genomesonline.org/Project_submission.htm) for both genome and metagenome projects.

Data availability

All GOLD data are available according to the Creative Commons License of Attribution-NonCommercial-ShareAlike (http://creativecommons.org/licenses/by-nc-sa/2.5/). All of the available metadata types in GOLD can be downloaded to an Excel file to facilitate wider distribution and use of the data.

OVERVIEW STATISTICS

Several different types of statistics, related to each of the data fields, can be derived using GOLD's advanced search engine, the new metadata search capability, and the data download capability. In addition, graphical overviews for specific data types are provided via the ‘Gold Statistics’ link on the database home page (http://genomesonline.org/gold_statistics.htm). This feature is supported for the data fields discussed in the following paragraphs.

Evolution of genome projects

Genome project tracking in GOLD has been steadily increasing over time with an average 2.25-fold increase every 2 years for the past 12 years (Figure 1A). The microbial genome projects have been carrying the majority of that increase. This systematic and comprehensive genome project tracking can help addressing two major questions: (i) where and how numerous are the remaining gaps in sequencing along the bacterial and archaeal branches of the tree of life, and (ii) how accurately can we predict the number of genome projects that will be sequenced over the next 3–5 years?

Table 4 addresses the first question by reporting the taxonomic distribution of genome projects, showing for each taxon the number of genome projects compared with the total number of described taxonomic units (filtering out the environmental and the unknown entries). In effect, it identifies the taxonomic groups in each domain of life for which there are no currently registered genome projects. These taxonomic groups should eventually become targets for new sequencing projects. Further, we hope that the availability of this systematic project monitoring will not only help identify the next sequencing targets, but also help the sequencing centers to avoid unnecessary redundancy and duplication of efforts.

Table 4.

Taxonomic distribution of genome projectsa

DomainPhylaClassOrderFamilyGenus
Archaea: 1795/59/924/2624/2685/109
Bacteria: 418427/2945/47234/281234/281730/1930
Eukarya: 128029/5580/188350/6288350/6288536/47906
DomainPhylaClassOrderFamilyGenus
Archaea: 1795/59/924/2624/2685/109
Bacteria: 418427/2945/47234/281234/281730/1930
Eukarya: 128029/5580/188350/6288350/6288536/47906

For each taxon, the number with genome projects (bold) compared to the total number of identified taxons according to NCBI's; Taxonomy.

Table 4.

Taxonomic distribution of genome projectsa

DomainPhylaClassOrderFamilyGenus
Archaea: 1795/59/924/2624/2685/109
Bacteria: 418427/2945/47234/281234/281730/1930
Eukarya: 128029/5580/188350/6288350/6288536/47906
DomainPhylaClassOrderFamilyGenus
Archaea: 1795/59/924/2624/2685/109
Bacteria: 418427/2945/47234/281234/281730/1930
Eukarya: 128029/5580/188350/6288350/6288536/47906

For each taxon, the number with genome projects (bold) compared to the total number of identified taxons according to NCBI's; Taxonomy.

Table 5 attempts to address the second question which is what is the anticipated growth of the microbial genome projects over the next 5 years? Following a very conservative estimate we would expect to see three times increase in the number of the complete and 10 times in the number of the draft microbial genome projects that have been sequenced during the last 15 years. However, if we extrapolate a linear increase in the number of finished and draft genomes based on Figure 1A, those predictions would be realized within the next 3 years.

Table 5.

Predicted increase of microbial genome sequencing projects

1995–20092010–2015
Finished10003000
Draft110011 000
Genes7.5 million genes56 million genes
1995–20092010–2015
Finished10003000
Draft110011 000
Genes7.5 million genes56 million genes
Table 5.

Predicted increase of microbial genome sequencing projects

1995–20092010–2015
Finished10003000
Draft110011 000
Genes7.5 million genes56 million genes
1995–20092010–2015
Finished10003000
Draft110011 000
Genes7.5 million genes56 million genes

Sequencing centers

Four major sequencing centers account for about 50% of the 5843 sequencing projects currently monitored in GOLD (Figure 1B), a situation that has not changed over the last 2 years. However, when considering only archaeal and bacterial projects, the two leading sequencing centers (JGI and JCVI) now represent a smaller share: about 35%, compared to more than half 2 years ago. The fact that a much larger community is now carrying out these projects compared to 2 years ago also reflects the increasing democratization of the sequencing technology.

Phylogenetic distribution

The sampling bias favoring three major bacterial lineages—Proteobacteria, Firmicutes and Actinobacteria—has decreased only slightly during the last couple of years (Figure 1D). The above three lineages now comprise 80% of all genome projects compared to 82% 2 years ago. This small shift is due mostly to large-scale sequencing efforts, such as the GEBA and HMP, which target previously neglected phylogenetic lineages. Clearly, there remains much room for improvement here, and further progress can be expected if similar large-scale biodiversity sequencing efforts continue.

FUTURE DIRECTIONS

The challenges facing GOLD have increased dramatically as GOLD continues to evolve from a genome/metagenome project monitoring system into a universal genome project core catalog/indexer charged with the task of providing data interconnectivity, exchange and dissemination. In this new role, GOLD is required to efficiently store, process and automatically track metadata that is rapidly increasing in scope and complexity. All the while, there is a great expectation for GOLD to pioneer future genomic standards.

To meet these challenges will require the creation of a shared genome project conceptual model and a database schema to handle the genome-project-associated metadata. The genome/metagenome data continue to be somewhat structured and hierarchical, but the rich associated metadata information becoming available requires the creation of a ‘Genome Project Ontology’ for effective management. Incorporation of other available ontologies, such as existing medical and environmental ontologies, is part of the immediate plan.

Furthermore, numerous other bioinformatics databases and researchers will need to acquire and/or synchronize with GOLD data. To address their needs, GOLD will provide access for client programs via web services using SOAP, GOLDXML and other RESTful technologies, as well as communicate with subscribers via RSS feeds. To further increase community access to GOLD, a GOLD-wiki site will be established where genome project curators can contribute additional project information using various media-rich data formats. We also plan to employ data warehousing tools to facilitate reporting and analysis of the GOLD data on the statistics page, thereby eliminating the need for the manual creation of Excel charts that become quickly outdated. To improve data mining, the GOLD search engine will provide an advanced query mechanism wherein the search criteria available will depend on the meta-properties of the input objects.

DATABASE AVAILABILITY

GOLD can be accessed at: http://www.genomesonline.org/.

Further comments and feedback are welcome at: [email protected].

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

ACKNOWLEDGEMENTS

We would like to thank Merry Youle for her excellent editorial assistance. GOLD has been maintained and developed mostly based on the volunteer work of its small team. We are grateful to all the colleagues who kindly provide information for the more accurate monitoring of the genome projects and particularly to Michelle Giglio and Heather Huot from University of Maryland. The support of Rashida Lathan, Stella Proukaki and Tatiana Drakakis is especially acknowledged. The full list of all contributors is available at: (http://www.genomesonline.org/acknowledgments.html).

FUNDING

The US Department of Energy's Office of Science, Biological and Environmental Research Program; and by the University of California, Lawrence Berkeley National Laboratory under Contract No. DE-AC02-05CH11231, Lawrence Livermore National Laboratory under Contract No. DE-AC52-07NA27344; and Los Alamos National Laboratory under Contract No. DE-AC02-06NA25396. Funding for open access charge: Department of Energy.

Conflict of interest statement. None declared.

REFERENCES

1.

Kyrpides
N
.
Genomes OnLine Database (GOLD 1.0): a monitor of complete and ongoing genome projects world-wide
.
Bioinformatics
(
1999
)
15
:
773
774
.

2.

Bernal
A
,
Ear
U
,
Kyrpides
N
.
Genomes Online Database (GOLD): A Monitor pf genome projects world-wide
.
Nucleic Acid Res.
(
2001
)
29
:
126
127
.

3.

Liolios
K
,
Tavernarakis
N
,
Hugenholtz
P
,
Kyrpides
NC
.
The Genomes On Line Database (GOLD) v.2: a monitor of Genome Projects world-wide
.
Nucleic Acid Res.
(
2006
)
34
:
D332
D334
.

4.

Liolios
K
,
Mavromatis
K
,
Tavernarakis
N
,
Kyrpides
NC
.
The Genomes OnLine Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata
.
Nucleic Acid Res.
(
2007
)
36
:
D475
D479
.

5.

Field
D
,
Garrity
G
,
Gray
T
,
Morrison
N
,
Selengut
J
,
Sterk
P
,
Tatusova
T
,
Thompson
N
,
Allen
MJ
,
Ashburner
M
, et al.
Towards a richer description of our complete collection of genomes and metagenomes: the “Minimum Information about a Genome Sequence” (MIGS) specification
.
Nat. Biotechnol.
(
2008
)
26
:
541
547
.

6.

Benson
DA
,
Karsch-Mizrachi
I
,
Lipman
DJ
,
Ostell
J
,
Wheeler
DL
.
GenBank
.
Nucleic Acids Res
(
2007
)
35
:
D21
D25
.

7.

Kulikova
T
,
Akhtar
R
,
Aldebert
P
,
Althorpe
N
,
Andersson
M
,
Baldwin
A
,
Bates
K
,
Bhattacharyya
S
,
Bower
L
,
Browne
P
, et al.
EMBL Nucleotide Sequence Database in 2006
.
Nucleic Acids Res.
(
2007
)
35
:
D16
D20
.

8.

Okubo
K
,
Sugawara
H
,
Gojobori
T
,
Tateno
Y
.
DDBJ in preparation for overview of research activities behind data submissions
.
Nucleic Acids Res.
(
2006
)
34
:
D6
D9
.

9.

Birney
E
,
Hudson
TJ
,
Green
ED
,
Gunter
C
,
Eddy
S
,
Rogers
J
,
Harris
JR
,
Ehrlich
SD
,
Apweiler
R
,
Austin
CP
, et al.
Prebublication data sharing
.
Nature
(
2009
)
461
:
168
170
.

10.

Kyrpides
NC
.
Fifteen years of microbial genomics: meeting the challenges and fulfilling the dream
.
Nat. Biotechnol.
(
2009
)
27
:
627
632
.

11.

Garrity
GM
,
Field
D
,
Kyrpides
NC
.
Standards in genomic sciences
.
Stand. Genomic Sci.
(
2009
)
1
:
1
2
.

12.

Dawyndt
P
,
Vancanneyt
M
,
DeMeyer
H
,
Swings
J
.
Knowledge accumulation and resolution of data inconsistencies during the integration of microbial information sources
.
IEEE Transactions on Knowledge and Data Engineering
(
2005
)
17
:
1111
1126
.

13.

Markowitz
VM
,
Ivanova
NN
,
Szeto
E
,
Palaniappan
K
,
Chu
K
,
Dalevi
D
,
Chen
IMA
,
Grechkin
Y
,
Dubchak
I
,
Anderson
I
, et al.
IMG/M: a data management and analysis system for metagenomes
.
Nucleic Acids Res.
(
2008
)
36
:
D534
D538
.

14.

Markowitz
VM
,
Szeto
E
,
Palaniappan
K
,
Grechkin
Y
,
Chu
K
,
Chen
I-MA
,
Dubchak
I
,
Anderson
I
,
Lykidis
A
,
Mavromatis
K
, et al.
The Integrated Microbial Genomes (IMG) system in 2007: data content and analysis tool extensions
.
Nucleic Acids Res.
(
2008
)
36
:
D528
D533
.

15.

Hirschman
L
,
Clark
C
,
Cohen
KB
,
Mardis
S
,
Luciano
J
,
Kottmann
R
,
Cole
J
,
Markowitz
V
,
Kyrpides
N
,
Morrison
N
, et al.
Habitat-Lite: a GSC case study based on free text terms for environmental metadata
.
OMICS
(
2008
)
12
:
129
136
.

16.

Markowitz
VM
,
Mavromatis
K
,
Ivanova
NN
,
Chen
I-MA
,
Chu
K
,
Kyrpides
NC
.
Expert IMG ER: a system for microbial genome annotation expert review and curation
.
Bioinformatics
(
2009
)
25
:
2271
2278
.

17.

Meyer
F
,
Paarmann
D
,
D'Souza
M
,
Olson
R
,
Glass
EM
,
Kubal
M
,
Paczian
T
,
Rodriguez
A
,
Stevens
R
,
Wilke
A
, et al.
The metagenomics RAST server – a public resource for the automatic phylogenetic and functional analysis of metagenomes
.
BMC Bioinformatics
(
2008
)
19
:
386
.

18.

Seshadri
R
,
Kravitz
SA
,
Smarr
L
,
Gilna
P
,
Frazier
M
.
CAMERA: a community resource for metagenomics
.
PLoS Biol.
(
2007
)
5
:
e75
.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Supplementary data

Comments

0 Comments
Submit a comment
You have entered an invalid code
Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.