Abstract

Summary: UniGene Tabulator 1.0 provides a solution for full parsing of UniGene flat file format; it implements a structured graphical representation of each data field present in UniGene following import into a common database managing system usable in a personal computer. This database includes related tables for sequence, protein similarity, sequence-tagged site (STS) and transcript map interval (TXMAP) data, plus a summary table where each record represents a UniGene cluster.

UniGene Tabulator enables full local management of UniGene data, allowing parsing, querying, indexing, retrieving, exporting and analysis of UniGene data in a relational database form, usable on Macintosh (OS X 10.3.9 or later) and Windows (2000, with service pack 4, XP, with service pack 2 or later) operating systems-based computers.

Availability: The current release, including both the FileMaker runtime applications, is freely available at Author Webpage

Contact:  [email protected]

Supplementary information: We also distribute a precalculated implementation for current Homo sapiens (build #190, March 2006) and Danio rerio (zebrafish, build #90, March 2006) UniGene data.

1 INTRODUCTION

UniGene is ‘an experimental system for automatically partitioning GenBank sequences into a non-redundant set of gene-oriented cluster’ (Wheeler et al., 2005). As a consequence, building UniGene has made it possible to assign a set of finished and expressed sequence tag (EST) (Boguski et al., 1993) RNA sequence entries to a unique genomic locus. This feature makes UniGene relevant in virtually every case where manipulation of data and sequences is necessary. In fact, UniGene data are commonly used in a wide range of databases and applications, including attribution of a spotted cDNA/oligonucleotide to a transcribed locus in microarray data analysis, integration of database and cross-referencing systems, gene function prediction (owing to availability of precomputed protein similarities for each cluster), systematic analysis of tissue-specific expression and map position, and gene discovery (Yuan et al., 2001).

There are two alternative ways to access UniGene data: querying the ‘Entrez’ website selecting UniGene as database (Author Webpage), or downloading UniGene flat files from its repository site abilitated for ftp service (Author Webpage). The first method enables the user to obtain only information that is searcheable via the website query form. Complex queries are only possible within the limits of ‘Entrez’ query language. It is then possible to download the retrieved information. The ftp downloading site allows users to save to their own computer a flat file containing all clusters data for the desired species.

In both cases the user obtains a plain text file, which contains data according to a standard text format that is nevertheless not intended to be structured in a relational way. Converting data available in a flat file format into the appropriate record fields of a relational database would require a method for parsing the information. A user with high information technology skills could use a programming or scripting language (BioPerl, C++, Java and so on) to this end. To date, an increasing number of papers have described software based on this kind of language designed for full parsing of the existing biomedical data repository, e.g. for Entrez Gene (Liu and Grigoriev, 2005) and MedLine (Oliver et al., 2004). However, although links to the UniGene website or internal cross-referencing to UniGene data may be provided, there exists no implementation describing a complete parser to convert UniGene flat files into a fully relational database.

We here describe the first ad hoc system for parsing, querying, retrieving, exporting and analysing UniGene data in a relational database form, usable on Macintosh (OS X 10.3.9 or later) and Windows (2000, with service pack 4, XP, with service pack 2, or later) operating systems-based computers.

2 RESULTS

To parse UniGene data, we have developed the software UniGene Tabulator. In this software, the application UniGene Tabulator (a FileMaker runtime) runs the original database named UniGene.UGT. FileMaker Pro engine (FileMaker, Santa Clara, CA; see Stephens, 2002, Author Webpage) is a database management software commonly used as an office application and is very easy to use. It may be distributed, using its developer version, as a free FileMaker runtime application for both Macintosh and Windows platforms. The runtime application allows the final user to perform a wide subset of FileMaker Pro operations without buying the entire software. The database UniGene.UGT successfully imports data concerning each specific feature of UniGene records, including cDNA library information, into appropriate data fields of a FileMaker Pro 8 relational database file.

UniGene Tabulator is freely available at the site Author Webpage. We also distribute a precalculated implementation for current Homo sapiens (build #190, March 2006) and Danio rerio (zebrafish, build #90, March 2006) UniGene data (Hs.190UGTabMac.sit or Hs.190UGTabWin.zip and Dr.90UGTabMac.sitx or Dr.90UGTabWin.zip).

Precalculated, text-format tables are finally provided in our distribution for the H.sapiens and D.rerio UniGene data, with updates scheduled.

2.1 Development of the software

First, we analyzed the detailed description of UniGene flat file format (Author Webpage) in order to identify characters suitable for representing consistent limits for each data type, and to convert the flat file format into a multiple related table series, allowing the appropriate import for each data type.

Our strategy is based on directly importing the downloaded files into a specific database table, and on parsing the information through an automated process, i.e. a FileMaker Pro script. The table ‘SEQUENCE’ will collect data from the ‘UniGene’ file (.data format) into the specific (hidden) layout ‘DATA’. The lines of UniGene files are delimited by a line feed (‘LF’) character, so each line will be imported into different records. During the importation step, each line is tagged, according to its starting characters, as referred to: (1) sequence data line; (2) STS data line; (3) transcript map position data line; (4) protein similarity data line; (5) cluster general information data line. Sequence data are processed directly within the fields of this table, shown in the layout ‘SEQUENCE’, while other information types are moved to their final destination specific table, during which process-appropriate scripts extract information for each single UniGene feature (e.g. accession number of an STS, protein accession number of a similar protein or gene symbol or chromosome localization and so on). Data for each library are imported on request from ‘.lib.info’ files, joined into a single database record, parsed via text calculation fields and eventually displayed via a relationship within each record in the layout ‘SEQUENCE’, which at the end of the process will represent all information available about each single nucleotide sequence listed in a UniGene cluster.

Information for each UniGene cluster about sequence tagged site (STS), gene map positions obtained from ‘Radiation hybrid’ experiments and the existence of known ortholog proteins will be displayed, if available, in the ‘STS’ table, ‘TXMAP’ table and ‘PROTSIM’ table, respectively.

Finally, all information common to each UniGene cluster, such as UniGene identifier, gene symbol or chromosomal location, is displayed in a single record in the summary table ‘UniGene’, so that each record corresponds to a single cluster.

A detailed user guide including technical specifications is distributed with the software, along with a user introductive tutorial. The runtime application allows full data import and export in several formats as well as complete record management and browsing, while the creation of new fields for further calculation or relationship definition within the database itself require the installation of the FileMaker Pro 8 application, or higher.

3 DISCUSSION

We have described the first software to execute a full parsing of UniGene data generating a relational, fully-indexed database, usable on Macintosh and Windows operating systems-based computers. UniGene Tabulator can not be used on UNIX-like operating system-based computers where no appropriate FileMaker Pro implementation is available; however it should run on any Windows full emulator. For example, we successfully used UniGene Tabulator for Windows on a Windows XP system running under Virtual PC 7.0.1 emulator software in a Macintosh 10.3.9 OS based computer.

Following a different approach from ours, a National Human Genome Research Institute (NHGRI, Author Webpage) site, has distributed a parsed version of human and mouse UniGene to be imported into an appropriate FileMaker Pro 4 database template (Author Webpage). However, this approach presents several limitations: first, the parser algorithm and software are not released; in addition, the database structure is not fully relational, in that it uses a ‘repeated field’ function which does not fulfill criteria for true database relationality and presents a limit of 1000 GenBank accession numbers per cluster; finally, parsed importable datasets were provided only for human and mouse UniGene data and periodic release was discontinued following human UniGene build #155 (September 2002).

UniGene Tabulator lets users parse updated UniGene flat files, with an increasing numbers of UniGene clusters or GenBank sequence accession numbers. Our implementation can export data, easily searched by user, in a general tabulated text format, suitable in any cross-platform data management system for any purpose.

In conjunction with our previous implementation of a GenBank format full parsing system (D'Addabbo et al., 2004, implemented under FileMaker Pro 6 environment and currently undergoing redesign in the entirely new ‘version 7’ structure), UniGene Tabulator may represent the nucleus for a novel relational multi-purpose and user-friendly modular platform for the analysis of biological data.

This work was funded by a MIUR ex 60% grant to P.S. Funding to pay the Open Access publication charges was provided by “Progetto Strategico d'Ateneo 2005” of University of Bologna to R.C.

Conflict of Interest: none declared.

REFERENCES

Boguski
M.S.
et al.
,
dbEST—database for ‘expressed sequence tags’
Nat. Genet
,
1993
, vol.
4
(pg.
332
-
333
)
D'Addabbo
P.
et al.
,
GeneRecords: a relational database for GenBank flat file parsing and data manipulation in personal computers
Bioinformatics
,
2004
, vol.
20
(pg.
2883
-
2885
)
Liu
M.
Grigoriev
A.
,
Fast parsers for Entrez Gene
Bioinformatics
,
2005
, vol.
21
(pg.
3189
-
3190
)
Oliver
D.E.
et al.
,
Tools for loading MEDLINE into a local relational database
BMC Bioinformatics
,
2004
, vol.
5
pg.
146
Stephens
P.
,
Alternative access
Developer Netw. J.
,
2002
, vol.
28
(pg.
30
-
32
)
Wheeler
D.L.
et al.
,
Database resources of the National Center for Biotechnology Information
Nucleic Acids Res.
,
2005
, vol.
33
(pg.
D39
-
D45
)
Yuan
J.
et al.
,
Genome analysis with gene-indexing databases
Pharmacol. Ther.
,
2001
, vol.
91
(pg.
115
-
132
)

Author notes

Associate Editor: Alfonso Valencia

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License () which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.