DDBJ with new system and face

Abstract

DDBJ ( http://www.ddbj.nig.ac.jp ) collected and released 1 880 115 entries or 1 134 086 245 bases in the period from July 2006 to June 2007. The released data contains the high-throughput cDNAs of cricket and high-quality draft genome of medaka among others. Our computer system has been upgraded since March 2007. Another new aspect is an efficient data retrieval tool that has recently been equipped and served at DDBJ. It is called All-round Retrieval for Sequence and Annotation, which enables the user to search for keywords also in the Feature/Qualifier of the International Nucleotide Sequence Database Collaboration ( http://www.insdc.org/ ). We will also replace our home page with a more efficient one by the end of 2007.

INTRODUCTION

Through our service we have witnessed dramatic advancements in biology and the related areas in the past 20 or more years. For example, using genome sequence data for eubacteria, archaebacteria and eukaryotes, some authors constructed a tree of life, which is the phylogenetic tree of the three super-kingdoms ( 1 , 2 ). Others reported a way to predict the number of genes at least in the bacterial world ( 3 ). The dramatic advancements prove our simple idea that the more data we collect and serve the more people make use of it for various purposes.

On the other hand, the recent development of sequencing machines such as 454 (by 454 Life Sciences), Solexa (by Illumina, Inc.) and SOLiD (by Applied Biosystems) makes us worrisome as well. According to some estimate, 5–10 tera bases will be sequenced by Solexa at one sequencing facility in a month in the near future. With the further development of the sequencing technology the whole genome of a person may repeatedly be submitted in the near future, as few examples warn ( 4 ). To cope with the expected situation of sequencing genes and genomes, we have recently upgraded our computer system and installed an efficient keyword search tool. We think that the new computer and tool serve our data submitters and users better and make our job more effective and efficient.

In this article we will report on the data submissions to DDBJ in the past year, replacement of our computer system with an upgraded one, a new data retrieval tool and a new home page.

DATA SUBMISSIONS TO DDBJ IN THE LAST YEAR

In the period from July 2006 to June 2007, DDBJ collected and released the original data of 1 880 115 entries or 1 134 086 245 bases that were classified into the 19 International Nucleotide Sequence Database Collaboration (INSDC) divisions ( 5 ). More than 90% of the submissions came from Japanese researchers, and the rest were mainly from Chinese and Korean researchers.

The released data includes the high-throughput cDNAs (HTC) of cricket, Gryllus bimaculatus submitted from Tokushima University ( 6 ). The data amount is 32 010 entries that can be obtained through anonymous FTP with the file name, Gryllus_bimaculatus_HTC_070726_1.seq.gz. Also included is 700 Mb of the high-quality draft genome data of medaka, Oryzias latipes , which was submitted from University of Tokyo and National Institute of Genetics ( 7 ). The data was carefully assembled and upgraded from the WGS data that was reported in our previous paper ( 8 ). The given accession numbers are BAAF03000000 (Hd-rR, version 0.9), BAAF04000000 (Hd-rR, version 1.0) BAAE01000000 (HNI) and ACAAA0000001-ACAAA0356693 (5′ SAGE tags). Although draft genome sequences of two fugu (blowfish) species are available, the high-quality draft genome of medaka will be quite useful particularly for the study of vertebrate evolution. The submitters of the genome data discussed, for example, that the medaka genome preserved its ancestral karyotype for more than 300 million years ( 7 ).

It is also noted that the current number of bacterial species/strains in the complete bacteria genome data repository, the Genome Information Broker, (GIB, http://gib.genes.nig.ac.jp/ ) ( 9 ), at DDBJ is 569 and keeps on growing rapidly. The species added in the past year include Methanococcus maripaludis (by Joint Genome Institute), Saccharopolyspora erythraea (by University of Cambridge), Francisella tularensis subsp. tularensis (by UT Southwestern Medical Center), Desulfotomaculum reducens (by Joint Genome Institute), Burkholderia vietnamiensis (by Joint Genome Institute), Herminiimonas arsenicoxydans (by Genoscope), Geobacillus thermodenitrificans (by Nankai University), Corynebacterium glutamicum (by RITE) and many others. We also serve a complete virus genome data repository, GIB for Viruses (GIB-V, http://gib-v.genes.nig.ac.jp/ ) that now contains 31 486 virus genomes and genomic segments.

NEW COMPUTER SYSTEM

In July 2007, we celebrated the 20th anniversary of the public release of the DNA data. Our first release in July 1987 contained only 66 entries or 108 970 bases that were typed in from published papers. These numbers may be impressive in the comparison with the corresponding ones as of June 2007, 13 371 690 entries or 8 988 178 758 bases. This tremendous increase in the numbers perhaps reflects the remarkable advancement of research in biology and the related areas in Japan in the past 20 years. The ever-increasing amount of the data also makes us worry about our hardware and software facilities.

In March 2007, we completely replaced our computer system with an upgraded one. Major upgraded aspects are as follows. (i) The increase in the number of entries in making the flat files from 300 000 to 1 000 000 entries/day, (ii) the decrease in processing time in making a huge flat file; in case of four rice chromosomes, from 110 to 13 min, (iii) the decrease in processing time from 120 to 13 min for updating the live-list that lists the accession numbers and dates of the public release of the released entries; it is weekly updated to exchange the information about the currently released data with the EMBL Bank and GenBank, (iv) the increase in the number of ESTs in data processing from 40 000 to 800 000 entries/h and (v) the increase in the number of queries accepted at once by 1.5 times. Therefore, we will be able to cope with the increase in the number of data submissions for the next several years.

NEW KEYWORD SEARCH TOOL

Recently, we have installed a high-speed keyword search tool, All-round Retrieval for Sequence and Annotation (ARSA, http://arsa.ddbj.nig.ac.jp/top-e.html ). The search logic behind ARSA is called SIGMA, which was invented by Arikawa and his colleagues ( 10 , 11 ). For a given query SIGMA makes it possible to retrieves all the right entries by checking the contents of a database just once, no matter how the query is complicated. The one time checking makes keyword search fast. SIGMA does not need an index file, which means that search can be made against the currently available data. SIGMA is implemented on the Shunsaku search engine developed by Fujitsu. The search engine operates in parallel for divided data, which makes the search even faster. ARSA also has a large scalability with an increasing amount of data. In theory, one search can be completed within 10 s irrespective of the data size and the query formula. If the data increases more than 10 times larger than the current amount, however, we may have to increase the number units in the Shunsaku accordingly to keep the present search speed.

ARSA covers 23 databases including DDBJ, UniProt, PFAM, PDB and LENZYME. A special feature of ARSA is that it can also incorporate the terms defined by the Feature/Qualifier of INSDC. While this feature is very helpful for us to annotate the submitted data, it enables our user to perform data retrieval by using terms in the Feature/Qualifier. For example, you can search for CDSs (protein coding sequences) located on human Y chromosome, as shown in Figure 1 . In the figure, the query formula is given on the top, and a part of the hit entries is given below with the accession numbers. By clicking one of the numbers you can see its contents. HUM in the last column stands for the human division. You can download the search result in Flat File, FASTA or XML, and also choose the items in the search results to be displayed on the computer screen and directly download them in tab-limited format. We also provide you with WebAPI ( http://xml.nig.ac.jp/ > http://xml.nig.ac.jp/ ) ( 12 ) so that you can customize ARSA by writing a program in Perl or JAVA. We will soon include KEGG ( http://www.genome.ad.jp/kegg/ ) in ARSA and make the 24 databases simultaneously retrievable for common keywords.

Figure 1.

Example of keyword search by ARSA. Keywords used are ‘source’ (Feature), ‘chromosome’ (Qualifier belonging to ‘source’) and CDS (Feature). ‘Chromosome’ has a value attribute to which ‘Y’ is given for specifying chromosome Y.

Open in new tab Download slide

NEW FACE OF DDBJ

We updated our home page (HP) in 2005 ( 13 ). We are again in the process of updating it rather drastically this time. Since the present HP holds many contents that have been added in an irregular sequence without much consideration for consistency, it is not really convenient now for our data submitters and users. The main point of the updating thus is to reach the almost every content with three clicks or less, which is now a common practice in making use of a HP. In the new HP when you click one of our main services, data submission, data retrieval, ftp/SOAP, statistics and inquiry, you can get the whole view of all contents for each service at once, and easily go to the one of them that you wish. The new HP will replace the present one by the end of 2007. We hope the new HP on the new computer system and tool will be more attractive to our data submitters and users worldwide.

ACKNOWLEDGEMENTS

We thank all DDBJ staff for the collection, annotation and distribution of the original data. We are also grateful to H. Yamada and T. Hosokawa for providing us with valuable information on ARSA. DDBJ is funded by Ministry of Education, Culture, Sports, Science and Technology (MEXT) with Management experience grants for national university cooperation. Funding to pay the Open Access publication charges for this article was provided by Japan Society for Promotion of Science Grant no. 16255006.

Conflict of interest statement . None declared.

REFERENCES

Wolf

Rozogin

Grishin

Koonin

Genome trees and the tree of life

Trends Genet.

2002

, vol.

(pg.

472

479

)

Fukami-Kobayashi

Minezaki

Tateno

Nishikawa

A tree of life based on protein domain organizations

Mol. Biol. Evol.

2007

, vol.

(pg.

1181

1189

)

Kosuge

Abe

Okido

Tanaka

Hirahata

Maruyama

Mashima

Tomiki

Kurokawa

et al. ,

Exploration and grading of possible genes in 183 bacterial strains by a common fine protocol lead to new genes: Gene Trek in Prokaryote Space (GTPS)

DNA Res.

2006

, vol.

(pg.

245

254

)

Levy

Sutton

Feuk

Halpern

Walenz

Axelrod

Huang

Kirkness

et al. ,

The diploid genome sequence of an individual human

PLoS Biol.

2007

, vol.

pg.

e254

Okubo

Sugawara

Gojobori

Tateno

DDBJ in preparation for overview of research activities behind data submissions

Nucleic Acids Res.

2006

, vol.

(pg.

)

Shinmyo

Mito

Uda

Nakamura

Miyawaki

Ohuchi

Noji

Brachyenteron is necessary for morphogenesis of the posterior gut but not for anteroposterior axial elongation from the posterior growth zone in the intermediate-germband cricket Gryllus bimaculatus

Development

2006

, vol.

133

(pg.

4539

4547

)

Kasahara

Naruse

Sasaki

Nakatani

Ahsan

Yamada

Nagayasu

Doi

et al. ,

The medaka draft genome and insights into vertebrate genome evolution

Nature

2007

, vol.

447

(pg.

714

719

)

Sugawara

Abe

Gojobori

Tateno

DDBJ working on evaluation and classification of bacterial genes in INSDC

Nucleic Acids Res.

2007

, vol.

(pg.

D13

D15

)

Fumoto

Miyazaki

Sugawara

Genome Information Broker (GIB): data retrieval and comparative analysis system for completed microbial genomes and more

Nucleic Acids Res.

2002

, vol.

(pg.

)

Arikawa

One-way sequential search systems and their powers

Bull. Math. Stat.

1981

, vol.

(pg.

)

Google Scholar

OpenURL Placeholder Text

WorldCat

Arikawa

Shinohara

Takeya

SIGMA: A text database management system

Berliners Informatik Tag

1989

(pg.

)

Google Scholar

OpenURL Placeholder Text

WorldCat

Sugawara

Miyazaki

Biological SOAP servers and web services provided by the public sequence data bank

Nucleic Acids Res.

2003

, vol.

(pg.

3836

3839

)

Tateno

Saitou

Okubo

Sugawara

Gojobori

DDBJ in collaboration with mass-sequencing teams on annotation

Nucleic Acids Res.

2005

, vol.

(pg.

D25

D28

)

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-com

Download all slides

Month:	Total Views:
February 2017	3
March 2017	1
April 2017	2
July 2017	7
August 2017	5
September 2017	5
October 2017	5
November 2017	7
December 2017	14
January 2018	13
February 2018	10
March 2018	12
April 2018	13
May 2018	20
June 2018	8
July 2018	19
August 2018	9
September 2018	9
October 2018	11
November 2018	17
December 2018	9
January 2019	7
February 2019	7
March 2019	45
April 2019	20
May 2019	13
June 2019	13
July 2019	8
August 2019	18
September 2019	16
October 2019	6
November 2019	9
December 2019	12
January 2020	8
February 2020	17
March 2020	8
April 2020	6
May 2020	11
June 2020	14
July 2020	10
August 2020	5
September 2020	12
October 2020	5
November 2020	10
December 2020	11
January 2021	1
February 2021	17
March 2021	29
April 2021	12
May 2021	4
June 2021	8
July 2021	12
August 2021	5
September 2021	4
October 2021	14
November 2021	14
December 2021	1
January 2022	7
February 2022	8
March 2022	5
April 2022	5
May 2022	10
June 2022	8
July 2022	7
August 2022	17
September 2022	9
October 2022	11
November 2022	7
December 2022	9
January 2023	6
February 2023	7
March 2023	3
April 2023	5
May 2023	11
June 2023	1
July 2023	3
August 2023	6
September 2023	7
October 2023	8
November 2023	7
December 2023	12
January 2024	9
February 2024	8
March 2024	27
April 2024	11
May 2024	41
June 2024	17
July 2024	20
August 2024	7
September 2024	7
October 2024	23
November 2024	12
December 2024	3
January 2025	6
February 2025	2
March 2025	6
April 2025	5
May 2025	8

Article Contents

DDBJ with new system and face

Abstract

INTRODUCTION

DATA SUBMISSIONS TO DDBJ IN THE LAST YEAR

NEW COMPUTER SYSTEM

NEW KEYWORD SEARCH TOOL

NEW FACE OF DDBJ

ACKNOWLEDGEMENTS

REFERENCES

Comments

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

DDBJ with new system and face Open Access

Abstract

INTRODUCTION

DATA SUBMISSIONS TO DDBJ IN THE LAST YEAR

NEW COMPUTER SYSTEM

NEW KEYWORD SEARCH TOOL

NEW FACE OF DDBJ

ACKNOWLEDGEMENTS

REFERENCES

Comments

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only

DDBJ with new system and face