A report on the 2009 SIG on short read sequencing and algorithms (Short-SIG)

High-throughput sequencing (HTS) technologies are revolutionizing the way biologists acquire and analyze genomic data. HTS instruments, such as the Illumina Genomic Analyzer and the Applied Biosystems SOLiD System, are currently able to sequence tens of gigabases per week, at a cost of 200-fold less than previous methods, potentially enabling the routine sequencing of human and other genomes. Over the last few years the promise of HTS technologies has become a reality, however, realizing that the full promise of these technologies requires the development of computational methods that can analyze the resulting datasets to infer biological meaning. HTS can be used to study many biological problems, including assembling genomes of new organisms, identifying genome variation within a population, discovering novel transcripts, analyzing gene expression, discerning the regulatory mechanisms behind the expression levels and profiling the metagenome of a community. While many HTS datasets are readily available, the main bottleneck in the analysis is the dearth of computational methods that are able to directly answer biologists' questions from these datasets.

The Special Interest Group on Short Read Sequencing and Algorithms (Short-SIG), held in conjunction with the Intelligent Systems in Molecular Biology (ISMB) conference, is a meeting that brings together computational biologists interested in analyzing these HTS datasets. The first Short-SIG, held in Toronto in 2008, brought together over 120 attendees, and featured 18 podium presentations, with many of them addressing the computational problems of read mapping—the alignment of reads to a larger reference genome—and assembly—the de novo generation of the genome of an organism from short read data. During the year that followed, significant progress has been made in these fields, and the topic of the 2009 meeting, held in Stockholm on June 28, concentrated on the development of methods that can analyze the resulting read mappings and assemblies to infer biological meaning. The meeting brought together over 200 researchers, and featured 17 platform presentations selected from 27 abstracts and 17 full paper submissions. The keynote address at the SIG was delivered by Dr Edwin Cuppen of the Hubrecht Laboratory (Utrecht, The Netherlands). The paper submissions were handled in coordination with the Bioinformatics journal, and a physical copy of the Bioinformatics ‘virtual issue’ on HTS, featuring papers published on this topic in Bioinformatics over the past year, was presented to all meeting attendees. Bioinformatics also sponsored a best paper award for the conference, given to Kai Ye and his co-authors for the paper ‘Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads’, as well as an award for a paper chosen from the ‘virtual issue’, that was given to Cole Trapnell and colleagues for ‘TopHat: discovering splice junctions with RNA-Seq’.

1 VARIATION DISCOVERY

One of the most prominent applications of HTS is the resequencing of human genomes. Genomic variants are discovered by mapping reads from a donor genome to a reference human genome (typically the NCBI assembly), and the resulting mappings are then analyzed to identify differences between the donor and the reference. The SIG saw eight presentations on variation identification, including variants of all sizes—from SNPs to larger, structural variants. From the SNP discovery perspective, Adrian Dalca presented VARiD, a generalized framework for calling SNPs from both regular (letter-space) reads and di-base encoded (color-space) data. Sohrab Shah presented SNVmix, a Bayesian mixture model-based method for discovering single nucleotide variants from somatic tissues, where the observed alleles and their ratios may vary due to adjacent tissues being present in a biopsy. There were also presentations on variation discovery methods from the two leading HTS technology manufacturers, Illumina and Life Technologies (Applied Biosystems). Dirk Evers of Illumina spoke of recent improvements to the CASAVA framework that allows for more accurate discovery of variants from long reads, especially indel and copy number variants. He presented the results of CASAVA on recently sequenced paired tumor/normal genomes from a melanoma cell line. Fiona C. L. Hyland, from Life Technologies, described diBayes, a Bayesian framework for SNP discovery from color-space data, and presented an extensive analysis of a SOLiD human dataset, including discovered SNPs, small indels and larger structural variants.

The advent of high-throughput sequencing has for the first time allowed large, cost-effective studies to detect larger, structural variants. Such variation has been associated with numerous diseases, including autism, schizophrenia and cancer, making their discovery an important challenge for computational biologists. Several talks at the SIG focused on the development of novel methods for discovery of such variants. Kai Ye presented a method called Pindel, which, by anchoring the mates of nonmapping reads to a genomic location, was able to use split-mapping to detect deletion events as large as 10 kb with base-level precision (Ye et al., 2009). This paper was the winner of the best SIG paper award, sponsored by Bioinformatics. Seunghak Lee and Weldon Whitener showed how to detect smaller indels from pair-end data by using the distribution of insert sizes of all matepairs that span each genomic location. Paul Medvedev described a way that the depth-of-coverage signal can be combined with pair-end mapping-based techniques to detect copy number variants within segmental duplications. Overall, this year has seen the detection of structural variation come to the forefront of algorithmic research, and the next year will hopefully bring about more fully developed biologist-friendly tools.

2 RNA SEQUENCING

Another exciting application of HTS technologies is RNA sequencing. RNA sequencing is currently used for several applications, including RNA expression, de novo transcriptome sequencing for nonmodel organisms and novel transcript discovery; however, computational methods for the analysis of this data are in their infancy. For RNA and microRNA expression profiling, HTS has significant advantages compared with microarray methods in that it is better able to identify quantities of very common and very rare transcripts. Short-SIG featured five talks addressing various computational problems in RNA sequencing. Cole Trapnell presented his work on BowTie (Trapnell et al., 2009), a tool to map reads from RNA sequencing to a reference genome, while allowing for split-reads where the two ends of a read map in different locations (due to exon splicing). While the original paper was published as part of the ‘virtual issue’, the presentation included new improvements to the tool. Jan Prins presented MapSplice, a RNA mapping tool that is similar to TopHat, but includes the ability to consider noncanonical splice sites. Inanc Birol presented a version of the ABySS assembler for de novo mRNA assembly. ABySS was the first tool to attempt de novo assembly of the human genome, and in their presentation they presented the first results on de novo assembly of human transcriptome data. Finally, three presentations demonstrated methods to mine RNA-seq data for specific biomedical phenomena: Regina Bohnert presented an algorithm for identifying alternative transcripts and their expression levels, Chol-Hee Jung presented an analysis of combining multiple Drosophila RNA-seq datasets in order to discover novel noncoding RNAs, and Gerald Quon showed that using the ISOLATE framework (Quon and Morris, 2009), mRNA expression levels can be used to identify the tissue of origin in metastasized tumors.

3 METAGENOMICS, ASSEMBLY AND STATISTICS

The final session of the SIG was devoted to a variety of classical and newly upcoming HTS applications. Bas Dutilh presented a method for mapping metagenomic reads to a reference genome, where the reference is changed during the mapping process to more accurately represent the community consensus genome, thus allowing a larger fraction of reads to map (Dutilh et al., 2009). Juliane Klein presented LOCAS, an assembler for short read data that is targeted toward low-coverage datasets, and significantly outperforms previous methods in this context. The last two presentations addressed the statistical issues underlying HTS. Su Yeon Kim described statistical foundation of designing association studies with HTS, specifically the use of a combination of pooled and unpooled samples from a number of individuals to design association studies. Adam Kowalczyk showed that it is possible to develop univariate statistical tests to compute the likelihood that two distributions of short read datasets are identical (P-values) based on the Poisson approximation to the binomial distribution.

4 KEYNOTE

The SIG ended with a keynote address by Dr Edwin Cuppen of the Hubrecht Laboratory, in Utrecht, The Netherlands. His presentation demonstrated both some interesting advantage of short read sequencing, such as the ability of CHiP-seq experiments to identify which genes are regulated by specific distal enhancers, and some key limitations, for example, that RNA-seq, while capable of profiling relative transcript levels in different conditions, is unable to reconstruct actual transcript levels due to biases introduced during sample preparation.

In addition to the podium presentations, many Short-SIG attendees used the opportunity to discuss collaborations and the general direction of the field. Clearly, the increase in read length (only 25–35 bp 2 years ago and 50–100 bp today) is making it difficult to develop timely tools, as the problems associated with different length reads are quite dissimilar. Illumina and SOLiD reads will very soon be as long as 454 reads were a few years ago, and this dynamics is forcing bioinformaticians to rethink algorithms developed only a year ago. Similarly, the increasing throughput of the sequencing platforms is requiring the scaling of the algorithms to larger datasets. Bioinformatics remains one of the key bottlenecks in HTS data analysis, with datasets created at a faster rate than can be effectively analyzed, and few tools providing ‘one stop shopping’ for the complete analysis of a single dataset. Addressing these shortcomings is a key step to realizing the full promise of HTS technologies.

Conflict of Interest: none declared.

REFERENCES

Dutilh

et al. ,

Increasing the coverage of a metapopulation consensus genome by iterative read mapping and assembly

Bioinformatics

2009

Google Scholar

OpenURL Placeholder Text

WorldCat

Quon

Morris

ISOLATE: a computational strategy for identifying the primary origin of cancers using high throughput sequencing

Bioinformatics.

2009

Google Scholar

OpenURL Placeholder Text

WorldCat

Trapnell

et al. ,

TopHat: discovering splice junctions with RNA-Seq

Bioinformatics

2009

, vol.

(pg.

1105

1111

)

et al. ,

Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads

Bioinformatics.

2009

Google Scholar

OpenURL Placeholder Text

WorldCat

Download all slides

Month:	Total Views:
December 2016	3
January 2017	1
February 2017	5
March 2017	5
April 2017	4
May 2017	9
July 2017	10
August 2017	5
October 2017	5
December 2017	27
January 2018	18
February 2018	6
March 2018	13
April 2018	16
May 2018	8
June 2018	20
July 2018	13
August 2018	29
September 2018	27
October 2018	4
November 2018	15
December 2018	11
January 2019	7
February 2019	15
March 2019	13
April 2019	17
May 2019	18
June 2019	8
July 2019	18
August 2019	24
September 2019	11
October 2019	8
November 2019	22
December 2019	17
January 2020	15
February 2020	10
March 2020	16
April 2020	12
May 2020	4
June 2020	8
July 2020	7
August 2020	3
September 2020	8
October 2020	7
November 2020	7
December 2020	2
January 2021	4
February 2021	7
March 2021	4
April 2021	7
May 2021	3
June 2021	9
July 2021	5
August 2021	7
September 2021	5
October 2021	4
November 2021	3
December 2021	1
January 2022	7
February 2022	7
March 2022	5
April 2022	2
May 2022	3
June 2022	7
July 2022	10
August 2022	8
September 2022	7
October 2022	14
November 2022	1
December 2022	8
January 2023	14
February 2023	4
March 2023	9
April 2023	8
May 2023	3
June 2023	6
July 2023	5
August 2023	8
September 2023	4
October 2023	3
November 2023	5
December 2023	5
January 2024	7
February 2024	7
March 2024	11
April 2024	2
May 2024	5
June 2024	8
July 2024	15
August 2024	8
September 2024	5
October 2024	44
November 2024	16
December 2024	4
January 2025	3
February 2025	2
March 2025	8
May 2025	3

Article Contents

A report on the 2009 SIG on short read sequencing and algorithms (Short-SIG)

1 VARIATION DISCOVERY

2 RNA SEQUENCING

3 METAGENOMICS, ASSEMBLY AND STATISTICS

4 KEYNOTE

REFERENCES

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

A report on the 2009 SIG on short read sequencing and algorithms (Short-SIG) Free

1 VARIATION DISCOVERY

2 RNA SEQUENCING

3 METAGENOMICS, ASSEMBLY AND STATISTICS

4 KEYNOTE

REFERENCES

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only

A report on the 2009 SIG on short read sequencing and algorithms (Short-SIG)