Retrieve-ensembl-seq: user-friendly and large-scale retrieval of single or multi-genome sequences from Ensembl

Author Notes

Abstract

Summary: The preparation of an appropriate sequence dataset is the starting point of all genomic analyses. We present retrieve-ensembl-seq, an application that considerably eases the retrieval of sequences from the Ensembl database, via our user-friendly web site or web services. The user provides Ensembl identifiers or gene names, and the program returns corresponding upstream, downstream, intronic, exonic, UTR or whole gene sequences. retrieve-ensembl-seq also offers a multiple organism mode to retrieve sequences from homologous genes at any taxonomical level. And we introduce various original filters such as the masking of coding fragments and the avoidance of sequence redundancy for genes with multiple transcripts. retrieve-ensembl-seq is included in the software suite regulatory sequence analysis tools (RSAT), allowing instant submission of retrieved sequences to further analysis tools.

Availability: retrieve-ensembl-seq is integrated in the RSAT suite: http://rsat.ulb.ac.be/rsat. Web site: http://rsat.ulb.ac.be/rsat/retrieve-ensembl-seq_form.cgi. Web services: http://rsat.ulb.ac.be/rsat/web_services/RSATWS.wsdl. Stand-alone distribution: freely available under an academic licence to download from the RSAT web site. The complete manual, a convenient tutorial and demos are available from the RSAT website. Additional help can be found on the RSAT public forum.

Contact: [email protected]

1 TOOL FUNCTIONALITY

Any genomic study starts with the preparation of a sequence dataset to analyse. The Ensembl database (Hubbard et al., 2009) is a key provider of vertebrates and other eukaryotes genomic data, with 50 species supported in version 53. Moreover, it is currently expanding to a wider taxonomic range with bacteria, fungi, plants and protists. To retrieve personalized datasets, Ensembl offers access via BioMart (Kasprzyk et al., 2004) and application programming interfaces (APIs). However, BioMart does not fully exploits the richness of annotations provided by Ensembl (e.g. introns) and offers very limited support for multi-species sequence retrieval (maximum five species per query). The use of APIs requires programming skills which not all interested users possess.

We present here retrieve-ensembl-seq, a user-friendly program that allows to retrieve a large variety of sequence types (upstream, downstream, introns, exons, UTRs and whole genes) from the Ensembl database. At each query, retrieve-ensembl-seq establishes a direct connection to Ensembl via its Core and Compara APIs, thereby ensuring a permanent access to up-to-date sequences. Useful and original options are introduced here. These include masking of coding fragments (converted to N characters), and the possibility to get only non-redundant sequences for genes with multiple transcripts. Furthermore, when retrieving upstream or downstream sequences, one can prevent overlapping with a neighbouring gene or an open reading frame (ORF).

Comparative genomics takes a growing importance for the detection of conserved features across related genomes. Yet, fetching sequences remains a tedious almost organism-per-organism task. retrieve-ensembl-seq proposes an original multiple organism mode to automatically retrieve homologous sequences at any taxonomical level. The integration of retrieve-ensembl-seq in the regulatory sequence analysis tools (RSAT; Thomas-Chollier et al., 2008; van Helden et al., 2000) ensures its seamless interconnection with pattern discovery and matching programs, plus convenient utilities. To suit all types of users, the program is accessible via web site, SOAP web services and command line. The next sections detail the parameters of retrieve-ensembl-seq.

2 SINGLE ORGANISM MODE

2.1 Organism

All Ensembl-supported organisms are available. The web site always queries the latest Ensembl version. Previous versions can be accessed via web services or command line.

2.2 Query (Gene, transcript or protein ID)

The query can be one or several Ensembl IDs or gene names. Identifiers from other databases are also accepted as long as they are valid external references in Ensembl. If a transcript or protein ID is provided, the program first recovers the corresponding gene ID before sequence retrieval. It is thus presently not possible to query for one given transcript or protein when there are alternative ones.

2.3 Sequence type

The sequence type parameter determines the kind of sequence to retrieve (Fig. 1A). It can be of type upstream/downstream, gene (whole gene sequence, from 5′-most transcription start site (TSS) to 3′-most transcription termination site (TTS), intron (all intron sequences), first intron, exon (all exon sequences), non-coding exon (only non-coding parts of exons sequences), UTRs (both UTR sequences), 5′ UTR or 3′ UTR.

Fig. 1.

Schematic representation of genomic regions that can be retrieved with the program retrieve-ensembl-seq. (A) Sequence types (shown in italics). Boxes represent exons, broken lines represent introns and straight lines represent intergenic regions. Black boxes represent coding exons, white boxes represent non-coding exons. (B) Coordinate system used for upstream and downstream sequence types. Negative coordinates refer to positions upstream the reference feature, positive ones refer to positions downstream the reference feature. (C) Treatment of redundancy for genes with alternative transcripts. The dotted oval highlights mutually overlapping fragments in upstream sequences of transcripts 1 and 2. These will be merged in the retrieved sequences. Upstream sequence for transcript 3 will be fully retrieved.

Open in new tab Download slide

With the upstream/downstream type (Fig. 1B), it is necessary to specify the feature taken as reference and the positions relative to this feature (‘From’ and ‘To’ values). The ‘Relative to feature’ option value can be CDS (coding sequence; start or stop codon, depending on the upstream or downstream sequence position choice), mRNA (TSS for upstream sequences, TTS for downstream sequences) or gene (5′-most TSS for upstream sequences, 3′-most TTS for downstream sequences). ‘From’ and ‘To’ values can be negative or positive. A negative value refers to a coordinate upstream the feature chosen as reference, whereas a positive value refers to a coordinate downstream the feature chosen as reference. If a negative value is set for the ‘From’ parameter and a positive value is set for the ‘To’ parameter, the retrieved sequence will span over the reference feature.

2.4 Additional options

With the option ‘Prevent overlap with neighbouring’ ORF or gene, the sequence is truncated at the neighbouring ORF or gene limit.
Two masking options are available: ‘repeats’ (if annotated) and ‘coding sequences’.
It is possible to avoid redundant sequences due to alternative transcripts (Fig. 1C). When the query gene has more than one transcript, the retrieved sequences for this gene may have overlapping regions. This option removes, for each individual query, any redundancy in the retrieved sequence set, by merging mutually overlapping fragments.
The organism name in the sequence fasta header can be set to scientific, common or none (no organism name in the fasta header).
The type of output can be defined as display (results printed on result page), server (link to results printed on result page) or email (link to results sent by email).

3 MULTIPLE ORGANISM MODE

In multiple organism mode, retrieve-ensembl-seq automatically collects not only the sequences of the query organism, but also the sequences from its homologs in all other organisms. More refined results can be obtained by filtering the homologs on their homology type and/or on their taxonomic level (taxon).

The homology type filter supports all Ensembl-defined homologies: all homologs (no filter), orthologs (all orthologs), ortholog_one2one, ortholog_one2many, ortholog_many2many, apparent_ortholog_one2one, paralogs (all paralogs), within_species_paralog, between_species_paralog.

The taxon filter enables to restrict the retrieval of homologous sequences to a given taxonomic level (example: Mammalia).

All options described for the single organism mode apply here as well.

4 EXAMPLES AND HELP

To help first time users, two demos are available from the web site. DEMO_1 shows an example of single organism mode use and DEMO_2 a multiple organism one.

There is also a tutorial on the web site to guide beginners through the various options. Finally, there is a manual describing every parameter. For quick reference, each parameter of the web site form is linked to its description in the manual. A flowchart on the RSAT web site tutorials page shows typical analysis paths.

Funding: This work was supported by the BioSapiens Network of Excellence funded under the sixth Framework program of the European Communities (LSHG-CT-2003-503265) and the Vrije Universiteit Brussel (Geconcerteerde Onderzoeksactie 29). The BiGRe Laboratory is part of the Belgian Program on Interuniversity Attraction Poles, initiated by the Belgian Federal Science Policy Office, project P6/25 (BioMaGNet).

Conflict of Interest: none declared.

REFERENCES

Hubbard

TJP

et al. ,

Ensembl 2009

Nucleic Acids Res.

2009

, vol.

(pg.

D690

D697

)

Kasprzyk

et al. ,

EnsMart: a generic system for fast and flexible access to biological data

Genome Res.

2004

, vol.

(pg.

160

169

)

Thomas-Chollier

et al. ,

RSAT: regulatory sequence analysis tools

Nucleic Acids Res.

2008

, vol.

(pg.

W119

W127

)

Van Helden

et al. ,

A web site for the computational analysis of yeast regulatory sequences

Yeast

2000

, vol.

(pg.

177

187

)

Author notes

Associate Editor: Dmitrij Frishman

Download all slides

Month:	Total Views:
December 2016	3
January 2017	1
February 2017	6
March 2017	5
April 2017	3
May 2017	10
June 2017	5
July 2017	6
August 2017	4
September 2017	2
October 2017	5
November 2017	3
December 2017	36
January 2018	13
February 2018	51
March 2018	73
April 2018	70
May 2018	69
June 2018	32
July 2018	50
August 2018	20
September 2018	56
October 2018	58
November 2018	70
December 2018	26
January 2019	5
February 2019	23
March 2019	34
April 2019	40
May 2019	32
June 2019	32
July 2019	39
August 2019	38
September 2019	39
October 2019	45
November 2019	34
December 2019	33
January 2020	34
February 2020	35
March 2020	32
April 2020	33
May 2020	20
June 2020	36
July 2020	16
August 2020	11
September 2020	22
October 2020	16
November 2020	19
December 2020	23
January 2021	14
February 2021	21
March 2021	20
April 2021	12
May 2021	2
June 2021	12
July 2021	16
August 2021	15
September 2021	8
October 2021	15
November 2021	19
December 2021	8
January 2022	15
February 2022	7
March 2022	12
April 2022	17
May 2022	28
June 2022	21
July 2022	16
August 2022	19
September 2022	30
October 2022	14
November 2022	13
December 2022	10
January 2023	18
February 2023	8
March 2023	20
April 2023	22
May 2023	18
June 2023	15
July 2023	11
August 2023	11
September 2023	8
October 2023	10
November 2023	9
December 2023	3
January 2024	10
February 2024	5
March 2024	11
April 2024	8
May 2024	10
June 2024	20
July 2024	18
August 2024	5
September 2024	9
October 2024	7
November 2024	19
December 2024	4
January 2025	7
February 2025	2
March 2025	10
April 2025	3

Article Contents

Retrieve-ensembl-seq: user-friendly and large-scale retrieval of single or multi-genome sequences from Ensembl

Abstract

1 TOOL FUNCTIONALITY

2 SINGLE ORGANISM MODE

2.1 Organism

2.2 Query (Gene, transcript or protein ID)

2.3 Sequence type

2.4 Additional options

3 MULTIPLE ORGANISM MODE

4 EXAMPLES AND HELP

REFERENCES

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

Retrieve-ensembl-seq: user-friendly and large-scale retrieval of single or multi-genome sequences from Ensembl Free

Abstract

1 TOOL FUNCTIONALITY

2 SINGLE ORGANISM MODE

2.1 Organism

2.2 Query (Gene, transcript or protein ID)

2.3 Sequence type

2.4 Additional options

3 MULTIPLE ORGANISM MODE

4 EXAMPLES AND HELP

REFERENCES

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only

Retrieve-ensembl-seq: user-friendly and large-scale retrieval of single or multi-genome sequences from Ensembl