High-throughput sequencing (HTS) technologies are revolutionizing the way biologists acquire and analyze genomic data. HTS instruments, such as the Illumina Genomic Analyzer and the Applied Biosystems SOLiD System, are currently able to sequence tens of gigabases per week, at a cost of 200-fold less than previous methods, potentially enabling the routine sequencing of human and other genomes. Over the last few years the promise of HTS technologies has become a reality, however, realizing that the full promise of these technologies requires the development of computational methods that can analyze the resulting datasets to infer biological meaning. HTS can be used to study many biological problems, including assembling genomes of new organisms, identifying genome variation within a population, discovering novel transcripts, analyzing gene expression, discerning the regulatory mechanisms behind the expression levels and profiling the metagenome of a community. While many HTS datasets are readily available, the main bottleneck in the analysis is the dearth of computational methods that are able to directly answer biologists' questions from these datasets.

The Special Interest Group on Short Read Sequencing and Algorithms (Short-SIG), held in conjunction with the Intelligent Systems in Molecular Biology (ISMB) conference, is a meeting that brings together computational biologists interested in analyzing these HTS datasets. The first Short-SIG, held in Toronto in 2008, brought together over 120 attendees, and featured 18 podium presentations, with many of them addressing the computational problems of read mapping—the alignment of reads to a larger reference genome—and assembly—the de novo generation of the genome of an organism from short read data. During the year that followed, significant progress has been made in these fields, and the topic of the 2009 meeting, held in Stockholm on June 28, concentrated on the development of methods that can analyze the resulting read mappings and assemblies to infer biological meaning. The meeting brought together over 200 researchers, and featured 17 platform presentations selected from 27 abstracts and 17 full paper submissions. The keynote address at the SIG was delivered by Dr Edwin Cuppen of the Hubrecht Laboratory (Utrecht, The Netherlands). The paper submissions were handled in coordination with the Bioinformatics journal, and a physical copy of the Bioinformatics ‘virtual issue’ on HTS, featuring papers published on this topic in Bioinformatics over the past year, was presented to all meeting attendees. Bioinformatics also sponsored a best paper award for the conference, given to Kai Ye and his co-authors for the paper ‘Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads’, as well as an award for a paper chosen from the ‘virtual issue’, that was given to Cole Trapnell and colleagues for ‘TopHat: discovering splice junctions with RNA-Seq’.

1 VARIATION DISCOVERY

One of the most prominent applications of HTS is the resequencing of human genomes. Genomic variants are discovered by mapping reads from a donor genome to a reference human genome (typically the NCBI assembly), and the resulting mappings are then analyzed to identify differences between the donor and the reference. The SIG saw eight presentations on variation identification, including variants of all sizes—from SNPs to larger, structural variants. From the SNP discovery perspective, Adrian Dalca presented VARiD, a generalized framework for calling SNPs from both regular (letter-space) reads and di-base encoded (color-space) data. Sohrab Shah presented SNVmix, a Bayesian mixture model-based method for discovering single nucleotide variants from somatic tissues, where the observed alleles and their ratios may vary due to adjacent tissues being present in a biopsy. There were also presentations on variation discovery methods from the two leading HTS technology manufacturers, Illumina and Life Technologies (Applied Biosystems). Dirk Evers of Illumina spoke of recent improvements to the CASAVA framework that allows for more accurate discovery of variants from long reads, especially indel and copy number variants. He presented the results of CASAVA on recently sequenced paired tumor/normal genomes from a melanoma cell line. Fiona C. L. Hyland, from Life Technologies, described diBayes, a Bayesian framework for SNP discovery from color-space data, and presented an extensive analysis of a SOLiD human dataset, including discovered SNPs, small indels and larger structural variants.

The advent of high-throughput sequencing has for the first time allowed large, cost-effective studies to detect larger, structural variants. Such variation has been associated with numerous diseases, including autism, schizophrenia and cancer, making their discovery an important challenge for computational biologists. Several talks at the SIG focused on the development of novel methods for discovery of such variants. Kai Ye presented a method called Pindel, which, by anchoring the mates of nonmapping reads to a genomic location, was able to use split-mapping to detect deletion events as large as 10 kb with base-level precision (Ye et al., 2009). This paper was the winner of the best SIG paper award, sponsored by Bioinformatics. Seunghak Lee and Weldon Whitener showed how to detect smaller indels from pair-end data by using the distribution of insert sizes of all matepairs that span each genomic location. Paul Medvedev described a way that the depth-of-coverage signal can be combined with pair-end mapping-based techniques to detect copy number variants within segmental duplications. Overall, this year has seen the detection of structural variation come to the forefront of algorithmic research, and the next year will hopefully bring about more fully developed biologist-friendly tools.

2 RNA SEQUENCING

Another exciting application of HTS technologies is RNA sequencing. RNA sequencing is currently used for several applications, including RNA expression, de novo transcriptome sequencing for nonmodel organisms and novel transcript discovery; however, computational methods for the analysis of this data are in their infancy. For RNA and microRNA expression profiling, HTS has significant advantages compared with microarray methods in that it is better able to identify quantities of very common and very rare transcripts. Short-SIG featured five talks addressing various computational problems in RNA sequencing. Cole Trapnell presented his work on BowTie (Trapnell et al., 2009), a tool to map reads from RNA sequencing to a reference genome, while allowing for split-reads where the two ends of a read map in different locations (due to exon splicing). While the original paper was published as part of the ‘virtual issue’, the presentation included new improvements to the tool. Jan Prins presented MapSplice, a RNA mapping tool that is similar to TopHat, but includes the ability to consider noncanonical splice sites. Inanc Birol presented a version of the ABySS assembler for de novo mRNA assembly. ABySS was the first tool to attempt de novo assembly of the human genome, and in their presentation they presented the first results on de novo assembly of human transcriptome data. Finally, three presentations demonstrated methods to mine RNA-seq data for specific biomedical phenomena: Regina Bohnert presented an algorithm for identifying alternative transcripts and their expression levels, Chol-Hee Jung presented an analysis of combining multiple Drosophila RNA-seq datasets in order to discover novel noncoding RNAs, and Gerald Quon showed that using the ISOLATE framework (Quon and Morris, 2009), mRNA expression levels can be used to identify the tissue of origin in metastasized tumors.

3 METAGENOMICS, ASSEMBLY AND STATISTICS

The final session of the SIG was devoted to a variety of classical and newly upcoming HTS applications. Bas Dutilh presented a method for mapping metagenomic reads to a reference genome, where the reference is changed during the mapping process to more accurately represent the community consensus genome, thus allowing a larger fraction of reads to map (Dutilh et al., 2009). Juliane Klein presented LOCAS, an assembler for short read data that is targeted toward low-coverage datasets, and significantly outperforms previous methods in this context. The last two presentations addressed the statistical issues underlying HTS. Su Yeon Kim described statistical foundation of designing association studies with HTS, specifically the use of a combination of pooled and unpooled samples from a number of individuals to design association studies. Adam Kowalczyk showed that it is possible to develop univariate statistical tests to compute the likelihood that two distributions of short read datasets are identical (P-values) based on the Poisson approximation to the binomial distribution.

4 KEYNOTE

The SIG ended with a keynote address by Dr Edwin Cuppen of the Hubrecht Laboratory, in Utrecht, The Netherlands. His presentation demonstrated both some interesting advantage of short read sequencing, such as the ability of CHiP-seq experiments to identify which genes are regulated by specific distal enhancers, and some key limitations, for example, that RNA-seq, while capable of profiling relative transcript levels in different conditions, is unable to reconstruct actual transcript levels due to biases introduced during sample preparation.

In addition to the podium presentations, many Short-SIG attendees used the opportunity to discuss collaborations and the general direction of the field. Clearly, the increase in read length (only 25–35 bp 2 years ago and 50–100 bp today) is making it difficult to develop timely tools, as the problems associated with different length reads are quite dissimilar. Illumina and SOLiD reads will very soon be as long as 454 reads were a few years ago, and this dynamics is forcing bioinformaticians to rethink algorithms developed only a year ago. Similarly, the increasing throughput of the sequencing platforms is requiring the scaling of the algorithms to larger datasets. Bioinformatics remains one of the key bottlenecks in HTS data analysis, with datasets created at a faster rate than can be effectively analyzed, and few tools providing ‘one stop shopping’ for the complete analysis of a single dataset. Addressing these shortcomings is a key step to realizing the full promise of HTS technologies.

Conflict of Interest: none declared.

REFERENCES

Dutilh
BE
et al.
,
Increasing the coverage of a metapopulation consensus genome by iterative read mapping and assembly
Bioinformatics
,
2009
Quon
G
Morris
Q
,
ISOLATE: a computational strategy for identifying the primary origin of cancers using high throughput sequencing
Bioinformatics.
,
2009
Trapnell
C
et al.
,
TopHat: discovering splice junctions with RNA-Seq
Bioinformatics
,
2009
, vol.
25
(pg.
1105
-
1111
)
Ye
K
et al.
,
Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads
Bioinformatics.
,
2009