SAMStat: monitoring biases in next generation sequencing data

Author Notes

Abstract

Motivation: The sequence alignment/map format (SAM) is a commonly used format to store the alignments between millions of short reads and a reference genome. Often certain positions within the reads are inherently more likely to contain errors due to the protocols used to prepare the samples. Such biases can have adverse effects on both mapping rate and accuracy. To understand the relationship between potential protocol biases and poor mapping we wrote SAMstat, a simple C program plotting nucleotide overrepresentation and other statistics in mapped and unmapped reads in a concise html page. Collecting such statistics also makes it easy to highlight problems in the data processing and enables non-experts to track data quality over time.

Results: We demonstrate that studying sequence features in mapped data can be used to identify biases particular to one sequencing protocol. Once identified, such biases can be considered in the downstream analysis or even be removed by read trimming or filtering techniques.

Availability: SAMStat is open source and freely available as a C program running on all Unix-compatible platforms. The source code is available from http://samstat.sourceforge.net.

Contact: [email protected]

1 INTRODUCTION

Next generation sequencing is being applied to understand individual variation, the RNA output of a cell and epigenetic regulation. Not surprisingly, the mapping of short reads to the genome has received a lot of attention with over 30 programs published to date [see Trapnell and Salzberg (2009) for a review of the most commonly used approaches]. Nevertheless, commonly a noticeable fraction of reads remains unmatched to the reference genome in each experiment. One possibility is that these reads simply represent the fraction of reads containing more sequencing errors in the form of mismatches, insertions or deletions than the programs can handle. Alternatively, it is conceivable that these reads contain contaminants and therefore do not map to the expected reference sequence. Finally, the unmapped reads may represent novel splice junctions or genomic regions absent from the reference assembly. Understanding the reason behind obtaining unmapped reads is clearly of interest.

Mapping programs like MAQ (Li et al., 2008) and BWA (Li and Durbin, 2009) report mapping qualities allowing for further investigation. We wrote SAMstat to contrast properties of unmapped, poorly mapped and accurately mapped reads to understand whether particular properties of the reads influence the mapping accuracy. As the name suggests, our program is designed to work mainly with SAM/BAM files (Li et al., 2009) but also not only can be used to visualize nucleotide composition and other basic statistics of fasta and fastq (Cock et al., 2009) files.

2 METHODS

SAMStat automatically recognizes the input files as either fasta, fastq, SAM or BAM and reports several basic properties of the sequences as listed in Table 1. Multiple input files can be given for batch processing. For each dataset, the output consists of a single html5 page containing several plots allowing non-specialists to visually inspect the results. Naturally, the html5 pages can be viewed both on- and off-line and easily be stored for future reference. All properties are plotted separately for different mapping quality intervals if those are present in the input file. For example, mismatch profiles are given for high-and low-quality alignments allowing users to verify whether poorly mapped reads contain a specific collection of mismatches. The latter may represent untrimmed linkers in a subset of reads. Dinucleotide overrepresentation is calculated as described by Frith et al. (2008). Overrepresented 10mers are calculated by comparing the frequency of 10mer within a mapping quality interval compared with the overall frequency of the 10mer.

Table 1.

Open in new tab

Overview of SAMstat output

Reported statistics
Mapping rate^a
Read length distribution
Nucleotide composition
Mean base quality at each read position
Overrepresented 10mers
Overrepresented dinucleotides along read
Mismatch, insertion and deletion profile^a

Reported statistics
Mapping rate^a
Read length distribution
Nucleotide composition
Mean base quality at each read position
Overrepresented 10mers
Overrepresented dinucleotides along read
Mismatch, insertion and deletion profile^a

^aOnly reported for SAM files.

Table 1.

Open in new tab

Overview of SAMstat output

Reported statistics
Mapping rate^a
Read length distribution
Nucleotide composition
Mean base quality at each read position
Overrepresented 10mers
Overrepresented dinucleotides along read
Mismatch, insertion and deletion profile^a

Reported statistics
Mapping rate^a
Read length distribution
Nucleotide composition
Mean base quality at each read position
Overrepresented 10mers
Overrepresented dinucleotides along read
Mismatch, insertion and deletion profile^a

^aOnly reported for SAM files.

3 RESULTS AND DISCUSSION

To demonstrate how SAMStat can be used to visualize mapping properties of a next generation datasets, we used data from a recently published transcriptome study (Plessy et al., 2010); (DDBJ short read archive: DRA000169). We mapped all 24 million 5′ reads to the human genome (GRCh37/hg19 assembly) using BWA (Li and Durbin, 2009) with default parameters. SAMStat parsed the alignment information in ∼3 min which is comparable to the 2 min it takes to copy the SAM file from one directory to another. The majority of reads can be mapped with very high confidence (Fig. 1a). When inspecting the mismatch error profiles, we noticed that there are many mismatches involving a guanine residue at the very start of many reads (yellow bars in Fig. 1b–e). These 5′ added guanine residues are known to originate from the reverse transcriptase step in preparing the cDNAs (Carninci et al., 2006). When comparing the mismatch profiles for high (Fig. 1b) to low-quality alignments (Fig. 1e), it is clear that a proportion of reads contain multiple 5′ added G's which in turn pose a problem to the mapping. For example, at the lowest mapping quality (Fig. 1e), there are frequent mismatches involving G's at positions one, two and to a lesser extent until position five while in high-quality alignments the mismatches are confined to the first position of the reads (Fig. 1b).

$A selection of SAMStat's html output. (a) Mapping statistics. More than half of the reads are mapped with a high mapping accuracy (red) while 9.9% of the reads remain unmapped (black). (b) Barcharts showing the distribution of mismatches and insertions along the read for alignments with the highest mapping accuracy [shown in red in (a)]. The colors indicate the mismatched nucleotides found in the read or the nucleotides inserted into the read. (c,d and e) Frequency of mismatches at the start of reads with mapping accuracies 1e−3 ≤ P < 1e−2, 1e−2 ≤ P < 0.5 and 0.5 ≤ P < 1, respectively (shown in orange, yellow and blue in panel a). The fraction of mismatches involving G's at position 2–5 increases. (f) Percentage of ‘GG’ dinucleotides at positions 1–5 in reads split up by mapping quality intervals. The background color highlights large percentages. The first and last row for nucleotides ‘GT’ and ‘GC’ are shown for comparison.$

Fig. 1.

A selection of SAMStat's html output. (a) Mapping statistics. More than half of the reads are mapped with a high mapping accuracy (red) while 9.9% of the reads remain unmapped (black). (b) Barcharts showing the distribution of mismatches and insertions along the read for alignments with the highest mapping accuracy [shown in red in (a)]. The colors indicate the mismatched nucleotides found in the read or the nucleotides inserted into the read. (c,d and e) Frequency of mismatches at the start of reads with mapping accuracies 1e⁻³ ≤ P < 1e⁻², 1e⁻² ≤ P < 0.5 and 0.5 ≤ P < 1, respectively (shown in orange, yellow and blue in panel a). The fraction of mismatches involving G's at position 2–5 increases. (f) Percentage of ‘GG’ dinucleotides at positions 1–5 in reads split up by mapping quality intervals. The background color highlights large percentages. The first and last row for nucleotides ‘GT’ and ‘GC’ are shown for comparison.

Open in new tab Download slide

Alongside the mismatch profiles SAMstat gives a table listing the percentages of each dinucleotide at each position of the reads split up by mapping quality intervals (Fig. 1f). For the present dataset, 60.4% of unmapped reads start with ‘GG’ and 10.1 percent contain a ‘GG’ at position 4. Evidently, 5′ G residues are added during library preparation and the start positions of mappings should be adjusted accordingly.

SAMStat is ideally suited to deal with the ever increasing amounts of data from second-and third-generation sequencing projects. Specific applications include the verification and quality control of processing pipelines, the tracking of data quality over time and the visualization of data properties derived from new protocols and approaches which in turn often leads to novel insights.

ACKNOWLEDGEMENT

We thank the reviewers for constructive suggestions.

Funding: Research Grant for the RIKEN Omics Science Center from the Ministry of Education, Culture, Sports, Science and Technology of the Japanese Government (MEXT) to Y.H.

Conflict of Interest: none declared.

REFERENCES

Carninci

et al. ,

Genome-wide analysis of mammalian promoter architecture and evolution

Nat. Genet.

2006

, vol.

(pg.

626

635

)

Cock

et al. ,

The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants

Nucleic Acids Res.

2009

, vol.

(pg.

1767

1771

)

Frith

et al. ,

A code for transcription initiation in mammalian genomes

Genome Res.

2008

, vol.

(pg.

)

Durbin

Fast and Accurate Short Read Alignment with Burrows-Wheeler Transform

Bioinformatics

2009

, vol.

(pg.

1754

1760

)

et al. ,

Mapping short DNA sequencing reads and calling variants using mapping quality scores

Genome Res.

2008

, vol.

(pg.

1851

1858

)

et al. ,

The Sequence Alignment/Map format and SAMtools

Bioinformatics

2009

, vol.

(pg.

2078

2079

)

Plessy

et al. ,

Linking promoters to functional transcripts in small samples with nanoCAGE and CAGEscan

Nat. Methods

2010

, vol.

(pg.

528

534

)

Trapnell

Salzberg

How to map billions of short reads onto genomes

Nat. Biotechnol.

2009

, vol.

(pg.

455

457

)

Author notes

Associate Editor: Martin Bishop

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download all slides

Month:	Total Views:
November 2016	2
December 2016	4
January 2017	6
February 2017	17
March 2017	13
April 2017	12
May 2017	15
June 2017	10
July 2017	13
August 2017	18
September 2017	8
October 2017	9
November 2017	19
December 2017	18
January 2018	21
February 2018	31
March 2018	49
April 2018	23
May 2018	28
June 2018	29
July 2018	28
August 2018	17
September 2018	40
October 2018	40
November 2018	32
December 2018	35
January 2019	19
February 2019	31
March 2019	25
April 2019	30
May 2019	30
June 2019	19
July 2019	25
August 2019	22
September 2019	26
October 2019	15
November 2019	22
December 2019	24
January 2020	20
February 2020	22
March 2020	18
April 2020	25
May 2020	14
June 2020	13
July 2020	17
August 2020	17
September 2020	18
October 2020	35
November 2020	19
December 2020	13
January 2021	7
February 2021	24
March 2021	23
April 2021	16
May 2021	33
June 2021	9
July 2021	14
August 2021	7
September 2021	21
October 2021	20
November 2021	21
December 2021	10
January 2022	30
February 2022	18
March 2022	29
April 2022	14
May 2022	18
June 2022	51
July 2022	25
August 2022	23
September 2022	23
October 2022	37
November 2022	8
December 2022	13
January 2023	45
February 2023	31
March 2023	12
April 2023	17
May 2023	15
June 2023	15
July 2023	16
August 2023	13
September 2023	14
October 2023	37
November 2023	22
December 2023	36
January 2024	36
February 2024	17
March 2024	40
April 2024	26
May 2024	21
June 2024	23
July 2024	37
August 2024	24
September 2024	22
October 2024	23
November 2024	16
December 2024	12
January 2025	6
February 2025	18
March 2025	10
April 2025	9

Article Contents

SAMStat: monitoring biases in next generation sequencing data

Abstract

1 INTRODUCTION

2 METHODS

3 RESULTS AND DISCUSSION

ACKNOWLEDGEMENT

REFERENCES

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

SAMStat: monitoring biases in next generation sequencing data

Abstract

1 INTRODUCTION

2 METHODS

3 RESULTS AND DISCUSSION

ACKNOWLEDGEMENT

REFERENCES

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only