UniGene Tabulator: a full parser for the UniGene format

Author Notes

Abstract

Summary: UniGene Tabulator 1.0 provides a solution for full parsing of UniGene flat file format; it implements a structured graphical representation of each data field present in UniGene following import into a common database managing system usable in a personal computer. This database includes related tables for sequence, protein similarity, sequence-tagged site (STS) and transcript map interval (TXMAP) data, plus a summary table where each record represents a UniGene cluster.

UniGene Tabulator enables full local management of UniGene data, allowing parsing, querying, indexing, retrieving, exporting and analysis of UniGene data in a relational database form, usable on Macintosh (OS X 10.3.9 or later) and Windows (2000, with service pack 4, XP, with service pack 2 or later) operating systems-based computers.

Availability: The current release, including both the FileMaker runtime applications, is freely available at Author Webpage

Contact: [email protected]

Supplementary information: We also distribute a precalculated implementation for current Homo sapiens (build #190, March 2006) and Danio rerio (zebrafish, build #90, March 2006) UniGene data.

1 INTRODUCTION

UniGene is ‘an experimental system for automatically partitioning GenBank sequences into a non-redundant set of gene-oriented cluster’ (Wheeler et al., 2005). As a consequence, building UniGene has made it possible to assign a set of finished and expressed sequence tag (EST) (Boguski et al., 1993) RNA sequence entries to a unique genomic locus. This feature makes UniGene relevant in virtually every case where manipulation of data and sequences is necessary. In fact, UniGene data are commonly used in a wide range of databases and applications, including attribution of a spotted cDNA/oligonucleotide to a transcribed locus in microarray data analysis, integration of database and cross-referencing systems, gene function prediction (owing to availability of precomputed protein similarities for each cluster), systematic analysis of tissue-specific expression and map position, and gene discovery (Yuan et al., 2001).

There are two alternative ways to access UniGene data: querying the ‘Entrez’ website selecting UniGene as database (Author Webpage), or downloading UniGene flat files from its repository site abilitated for ftp service (Author Webpage). The first method enables the user to obtain only information that is searcheable via the website query form. Complex queries are only possible within the limits of ‘Entrez’ query language. It is then possible to download the retrieved information. The ftp downloading site allows users to save to their own computer a flat file containing all clusters data for the desired species.

In both cases the user obtains a plain text file, which contains data according to a standard text format that is nevertheless not intended to be structured in a relational way. Converting data available in a flat file format into the appropriate record fields of a relational database would require a method for parsing the information. A user with high information technology skills could use a programming or scripting language (BioPerl, C++, Java and so on) to this end. To date, an increasing number of papers have described software based on this kind of language designed for full parsing of the existing biomedical data repository, e.g. for Entrez Gene (Liu and Grigoriev, 2005) and MedLine (Oliver et al., 2004). However, although links to the UniGene website or internal cross-referencing to UniGene data may be provided, there exists no implementation describing a complete parser to convert UniGene flat files into a fully relational database.

We here describe the first ad hoc system for parsing, querying, retrieving, exporting and analysing UniGene data in a relational database form, usable on Macintosh (OS X 10.3.9 or later) and Windows (2000, with service pack 4, XP, with service pack 2, or later) operating systems-based computers.

2 RESULTS

To parse UniGene data, we have developed the software UniGene Tabulator. In this software, the application UniGene Tabulator (a FileMaker runtime) runs the original database named UniGene.UGT. FileMaker Pro engine (FileMaker, Santa Clara, CA; see Stephens, 2002, Author Webpage) is a database management software commonly used as an office application and is very easy to use. It may be distributed, using its developer version, as a free FileMaker runtime application for both Macintosh and Windows platforms. The runtime application allows the final user to perform a wide subset of FileMaker Pro operations without buying the entire software. The database UniGene.UGT successfully imports data concerning each specific feature of UniGene records, including cDNA library information, into appropriate data fields of a FileMaker Pro 8 relational database file.

UniGene Tabulator is freely available at the site Author Webpage. We also distribute a precalculated implementation for current Homo sapiens (build #190, March 2006) and Danio rerio (zebrafish, build #90, March 2006) UniGene data (Hs.190UGTabMac.sit or Hs.190UGTabWin.zip and Dr.90UGTabMac.sitx or Dr.90UGTabWin.zip).

Precalculated, text-format tables are finally provided in our distribution for the H.sapiens and D.rerio UniGene data, with updates scheduled.

2.1 Development of the software

First, we analyzed the detailed description of UniGene flat file format (Author Webpage) in order to identify characters suitable for representing consistent limits for each data type, and to convert the flat file format into a multiple related table series, allowing the appropriate import for each data type.

Our strategy is based on directly importing the downloaded files into a specific database table, and on parsing the information through an automated process, i.e. a FileMaker Pro script. The table ‘SEQUENCE’ will collect data from the ‘UniGene’ file (.data format) into the specific (hidden) layout ‘DATA’. The lines of UniGene files are delimited by a line feed (‘LF’) character, so each line will be imported into different records. During the importation step, each line is tagged, according to its starting characters, as referred to: (1) sequence data line; (2) STS data line; (3) transcript map position data line; (4) protein similarity data line; (5) cluster general information data line. Sequence data are processed directly within the fields of this table, shown in the layout ‘SEQUENCE’, while other information types are moved to their final destination specific table, during which process-appropriate scripts extract information for each single UniGene feature (e.g. accession number of an STS, protein accession number of a similar protein or gene symbol or chromosome localization and so on). Data for each library are imported on request from ‘.lib.info’ files, joined into a single database record, parsed via text calculation fields and eventually displayed via a relationship within each record in the layout ‘SEQUENCE’, which at the end of the process will represent all information available about each single nucleotide sequence listed in a UniGene cluster.

Information for each UniGene cluster about sequence tagged site (STS), gene map positions obtained from ‘Radiation hybrid’ experiments and the existence of known ortholog proteins will be displayed, if available, in the ‘STS’ table, ‘TXMAP’ table and ‘PROTSIM’ table, respectively.

Finally, all information common to each UniGene cluster, such as UniGene identifier, gene symbol or chromosomal location, is displayed in a single record in the summary table ‘UniGene’, so that each record corresponds to a single cluster.

A detailed user guide including technical specifications is distributed with the software, along with a user introductive tutorial. The runtime application allows full data import and export in several formats as well as complete record management and browsing, while the creation of new fields for further calculation or relationship definition within the database itself require the installation of the FileMaker Pro 8 application, or higher.

3 DISCUSSION

We have described the first software to execute a full parsing of UniGene data generating a relational, fully-indexed database, usable on Macintosh and Windows operating systems-based computers. UniGene Tabulator can not be used on UNIX-like operating system-based computers where no appropriate FileMaker Pro implementation is available; however it should run on any Windows full emulator. For example, we successfully used UniGene Tabulator for Windows on a Windows XP system running under Virtual PC 7.0.1 emulator software in a Macintosh 10.3.9 OS based computer.

Following a different approach from ours, a National Human Genome Research Institute (NHGRI, Author Webpage) site, has distributed a parsed version of human and mouse UniGene to be imported into an appropriate FileMaker Pro 4 database template (Author Webpage). However, this approach presents several limitations: first, the parser algorithm and software are not released; in addition, the database structure is not fully relational, in that it uses a ‘repeated field’ function which does not fulfill criteria for true database relationality and presents a limit of 1000 GenBank accession numbers per cluster; finally, parsed importable datasets were provided only for human and mouse UniGene data and periodic release was discontinued following human UniGene build #155 (September 2002).

UniGene Tabulator lets users parse updated UniGene flat files, with an increasing numbers of UniGene clusters or GenBank sequence accession numbers. Our implementation can export data, easily searched by user, in a general tabulated text format, suitable in any cross-platform data management system for any purpose.

In conjunction with our previous implementation of a GenBank format full parsing system (D'Addabbo et al., 2004, implemented under FileMaker Pro 6 environment and currently undergoing redesign in the entirely new ‘version 7’ structure), UniGene Tabulator may represent the nucleus for a novel relational multi-purpose and user-friendly modular platform for the analysis of biological data.

This work was funded by a MIUR ex 60% grant to P.S. Funding to pay the Open Access publication charges was provided by “Progetto Strategico d'Ateneo 2005” of University of Bologna to R.C.

Conflict of Interest: none declared.

REFERENCES

Boguski

M.S.

et al. ,

dbEST—database for ‘expressed sequence tags’

Nat. Genet

1993

, vol.

(pg.

332

333

)

D'Addabbo

et al. ,

GeneRecords: a relational database for GenBank flat file parsing and data manipulation in personal computers

Bioinformatics

2004

, vol.

(pg.

2883

2885

)

Liu

Grigoriev

Fast parsers for Entrez Gene

Bioinformatics

2005

, vol.

(pg.

3189

3190

)

Oliver

D.E.

et al. ,

Tools for loading MEDLINE into a local relational database

BMC Bioinformatics

2004

, vol.

pg.

146

Stephens

Alternative access

Developer Netw. J.

2002

, vol.

(pg.

)

Google Scholar

OpenURL Placeholder Text

WorldCat

Wheeler

D.L.

et al. ,

Database resources of the National Center for Biotechnology Information

Nucleic Acids Res.

2005

, vol.

(pg.

D39

D45

)

Yuan

et al. ,

Genome analysis with gene-indexing databases

Pharmacol. Ther.

2001

, vol.

(pg.

115

132

)

Author notes

Associate Editor: Alfonso Valencia

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License () which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download all slides

Month:	Total Views:
November 2016	1
December 2016	2
January 2017	17
February 2017	26
March 2017	20
April 2017	44
May 2017	45
June 2017	8
July 2017	14
August 2017	5
October 2017	5
November 2017	1
December 2017	13
January 2018	3
February 2018	8
March 2018	19
April 2018	16
May 2018	15
June 2018	23
July 2018	10
August 2018	17
September 2018	25
October 2018	14
November 2018	10
December 2018	27
January 2019	15
February 2019	44
March 2019	31
April 2019	25
May 2019	21
June 2019	10
July 2019	17
August 2019	20
September 2019	30
October 2019	25
November 2019	21
December 2019	25
January 2020	27
February 2020	19
March 2020	37
April 2020	7
May 2020	17
June 2020	19
July 2020	2
August 2020	22
September 2020	28
October 2020	15
November 2020	27
December 2020	46
January 2021	16
February 2021	30
March 2021	34
April 2021	29
May 2021	13
June 2021	34
July 2021	34
August 2021	25
September 2021	15
October 2021	20
November 2021	25
December 2021	17
January 2022	22
February 2022	26
March 2022	38
April 2022	25
May 2022	21
June 2022	11
July 2022	19
August 2022	12
September 2022	50
October 2022	12
November 2022	8
December 2022	21
January 2023	7
February 2023	5
March 2023	11
April 2023	15
May 2023	13
June 2023	9
July 2023	6
August 2023	9
September 2023	5
October 2023	8
November 2023	18
December 2023	9
January 2024	13
February 2024	17
March 2024	19
April 2024	15
May 2024	12
June 2024	17
July 2024	17
August 2024	12
September 2024	15
October 2024	13
November 2024	16
December 2024	8
January 2025	4
February 2025	12
March 2025	9
April 2025	1
May 2025	4

Article Contents

UniGene Tabulator: a full parser for the UniGene format

Abstract

1 INTRODUCTION

2 RESULTS

2.1 Development of the software

3 DISCUSSION

REFERENCES

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

UniGene Tabulator: a full parser for the UniGene format Free

Abstract

1 INTRODUCTION

2 RESULTS

2.1 Development of the software

3 DISCUSSION

REFERENCES

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only

UniGene Tabulator: a full parser for the UniGene format