Comprehensive fundamental somatic variant calling and quality management strategies for human cancer genomes

Raw data analysis tools

Tool feature	Function	(i) Quality control tools			(ii) Pre-process tools							(iii) All-in-one tools
Tool feature	Software	FastQC	MultiQC	NGS-Bits (ReadQC)	FASTX-Toolkit	Cutadapt	TrimGalore	Trimmomatic	Skewer	PEAT	SOAPnuke	PRINSEQ	NGS QC Toolkit	FaQCs	AfterQC	Fastp
Publication feature	Citations/year	384.20	124.25	0.67	22.30	737.56	37.13	2148.50	64.50	4.20	26.67	270.00	178.13	11.83	16.00	86.00
Installation	Installation	B/F	B/C/P	B/C/F	B/F/C	B/P	B/F	B/F	B/C/F	C/F	C	B/F	C	B/C	B/C/F	B/C/F
Performance	Time (hh:mm:ss)	00:03:11	00:00:06	00:09:17	NA	00:10:29	00:05:56	01:03:22	00:22:55	00:59:35	00:07:08	08:23:20	01:04:09	01:09:17	04:02:50	00:07:03
Report feature	Format	html; text	html; text; json	qcML, tsv	text	stdout	text	text	stdout	text	text	html	html; text	pdf; text	html; json	html; pdf; json
	Visualisation	Before	Post	Post	Post	χ	χ	χ	χ	χ	χ	Post	Post	Post	Both	Both
	Interactive	χ	✓	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	✓	✓
Development feature	Multi-threading	✓	χ	χ	χ	✓	✓	✓	✓	✓	✓	χ	✓	✓	✓^*	✓
Development feature	Multi-sample	✓	✓	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	✓	✓	χ
Quality profile	Duplicated reads	✓	χ	χ	χ	χ	χ	χ	χ	χ	✓	✓	χ	✓	χ	✓
	Overlapping analysis	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	✓	✓
	Sequencing error profiling	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	✓	✓
Processing function	PE or SE	SE	χ	✓	SE	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
	Detect adapter automatically	χ	χ	χ	χ	χ	χ	χ	χ	✓	χ	χ	✓	χ	✓	✓
	Cut adapter	χ	χ	χ	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
	Convert quality	χ	χ	χ	χ	✓	✓	✓	✓	✓	✓	✓	✓	✓	χ	✓
	Quality filtering	χ	χ	χ	✓	χ	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
	Trimming 3′ or 5’	χ	χ	χ	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
	Sequence complexity	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	✓	χ	✓	χ	✓
	UMI	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	✓
	Compressed file	✓	✓	✓=	✓≈	✓	✓	✓	✓	✓	✓	χ	✓	✓=	✓	✓

Tool feature	Function	(i) Quality control tools			(ii) Pre-process tools							(iii) All-in-one tools
Tool feature	Software	FastQC	MultiQC	NGS-Bits (ReadQC)	FASTX-Toolkit	Cutadapt	TrimGalore	Trimmomatic	Skewer	PEAT	SOAPnuke	PRINSEQ	NGS QC Toolkit	FaQCs	AfterQC	Fastp
Publication feature	Citations/year	384.20	124.25	0.67	22.30	737.56	37.13	2148.50	64.50	4.20	26.67	270.00	178.13	11.83	16.00	86.00
Installation	Installation	B/F	B/C/P	B/C/F	B/F/C	B/P	B/F	B/F	B/C/F	C/F	C	B/F	C	B/C	B/C/F	B/C/F
Performance	Time (hh:mm:ss)	00:03:11	00:00:06	00:09:17	NA	00:10:29	00:05:56	01:03:22	00:22:55	00:59:35	00:07:08	08:23:20	01:04:09	01:09:17	04:02:50	00:07:03
Report feature	Format	html; text	html; text; json	qcML, tsv	text	stdout	text	text	stdout	text	text	html	html; text	pdf; text	html; json	html; pdf; json
	Visualisation	Before	Post	Post	Post	χ	χ	χ	χ	χ	χ	Post	Post	Post	Both	Both
	Interactive	χ	✓	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	✓	✓
Development feature	Multi-threading	✓	χ	χ	χ	✓	✓	✓	✓	✓	✓	χ	✓	✓	✓^*	✓
Development feature	Multi-sample	✓	✓	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	✓	✓	χ
Quality profile	Duplicated reads	✓	χ	χ	χ	χ	χ	χ	χ	χ	✓	✓	χ	✓	χ	✓
	Overlapping analysis	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	✓	✓
	Sequencing error profiling	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	✓	✓
Processing function	PE or SE	SE	χ	✓	SE	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
	Detect adapter automatically	χ	χ	χ	χ	χ	χ	χ	χ	✓	χ	χ	✓	χ	✓	✓
	Cut adapter	χ	χ	χ	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
	Convert quality	χ	χ	χ	χ	✓	✓	✓	✓	✓	✓	✓	✓	✓	χ	✓
	Quality filtering	χ	χ	χ	✓	χ	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
	Trimming 3′ or 5’	χ	χ	χ	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
	Sequence complexity	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	✓	χ	✓	χ	✓
	UMI	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	✓
	Compressed file	✓	✓	✓=	✓≈	✓	✓	✓	✓	✓	✓	χ	✓	✓=	✓	✓

✓, software has this function; χ, software does not have this function; ^*, software has this function, but no parameters were provided; #, SE; =, only input; ≈, only output.

B, install with BioConda or running Docker containers built by BioContainers; C, install by compiling from source; D, Docker container independent of BioContainer; F, free from compile by providing a pre-compiled binary file or script of Perl; P, install with pip.

‘Interactive’ row indicates reports that include interactive figures that can respond to user events like mouse hover.

‘Time’ was the average running time of processing 3 WES samples of Illumina paired-end sequencing.

Table 1

Open in new tab Download slide

Raw data analysis tools

Tool feature	Function	(i) Quality control tools			(ii) Pre-process tools							(iii) All-in-one tools
Tool feature	Software	FastQC	MultiQC	NGS-Bits (ReadQC)	FASTX-Toolkit	Cutadapt	TrimGalore	Trimmomatic	Skewer	PEAT	SOAPnuke	PRINSEQ	NGS QC Toolkit	FaQCs	AfterQC	Fastp
Publication feature	Citations/year	384.20	124.25	0.67	22.30	737.56	37.13	2148.50	64.50	4.20	26.67	270.00	178.13	11.83	16.00	86.00
Installation	Installation	B/F	B/C/P	B/C/F	B/F/C	B/P	B/F	B/F	B/C/F	C/F	C	B/F	C	B/C	B/C/F	B/C/F
Performance	Time (hh:mm:ss)	00:03:11	00:00:06	00:09:17	NA	00:10:29	00:05:56	01:03:22	00:22:55	00:59:35	00:07:08	08:23:20	01:04:09	01:09:17	04:02:50	00:07:03
Report feature	Format	html; text	html; text; json	qcML, tsv	text	stdout	text	text	stdout	text	text	html	html; text	pdf; text	html; json	html; pdf; json
	Visualisation	Before	Post	Post	Post	χ	χ	χ	χ	χ	χ	Post	Post	Post	Both	Both
	Interactive	χ	✓	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	✓	✓
Development feature	Multi-threading	✓	χ	χ	χ	✓	✓	✓	✓	✓	✓	χ	✓	✓	✓^*	✓
Development feature	Multi-sample	✓	✓	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	✓	✓	χ
Quality profile	Duplicated reads	✓	χ	χ	χ	χ	χ	χ	χ	χ	✓	✓	χ	✓	χ	✓
	Overlapping analysis	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	✓	✓
	Sequencing error profiling	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	✓	✓
Processing function	PE or SE	SE	χ	✓	SE	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
	Detect adapter automatically	χ	χ	χ	χ	χ	χ	χ	χ	✓	χ	χ	✓	χ	✓	✓
	Cut adapter	χ	χ	χ	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
	Convert quality	χ	χ	χ	χ	✓	✓	✓	✓	✓	✓	✓	✓	✓	χ	✓
	Quality filtering	χ	χ	χ	✓	χ	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
	Trimming 3′ or 5’	χ	χ	χ	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
	Sequence complexity	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	✓	χ	✓	χ	✓
	UMI	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	✓
	Compressed file	✓	✓	✓=	✓≈	✓	✓	✓	✓	✓	✓	χ	✓	✓=	✓	✓

Tool feature	Function	(i) Quality control tools			(ii) Pre-process tools							(iii) All-in-one tools
Tool feature	Software	FastQC	MultiQC	NGS-Bits (ReadQC)	FASTX-Toolkit	Cutadapt	TrimGalore	Trimmomatic	Skewer	PEAT	SOAPnuke	PRINSEQ	NGS QC Toolkit	FaQCs	AfterQC	Fastp
Publication feature	Citations/year	384.20	124.25	0.67	22.30	737.56	37.13	2148.50	64.50	4.20	26.67	270.00	178.13	11.83	16.00	86.00
Installation	Installation	B/F	B/C/P	B/C/F	B/F/C	B/P	B/F	B/F	B/C/F	C/F	C	B/F	C	B/C	B/C/F	B/C/F
Performance	Time (hh:mm:ss)	00:03:11	00:00:06	00:09:17	NA	00:10:29	00:05:56	01:03:22	00:22:55	00:59:35	00:07:08	08:23:20	01:04:09	01:09:17	04:02:50	00:07:03
Report feature	Format	html; text	html; text; json	qcML, tsv	text	stdout	text	text	stdout	text	text	html	html; text	pdf; text	html; json	html; pdf; json
	Visualisation	Before	Post	Post	Post	χ	χ	χ	χ	χ	χ	Post	Post	Post	Both	Both
	Interactive	χ	✓	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	✓	✓
Development feature	Multi-threading	✓	χ	χ	χ	✓	✓	✓	✓	✓	✓	χ	✓	✓	✓^*	✓
Development feature	Multi-sample	✓	✓	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	✓	✓	χ
Quality profile	Duplicated reads	✓	χ	χ	χ	χ	χ	χ	χ	χ	✓	✓	χ	✓	χ	✓
	Overlapping analysis	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	✓	✓
	Sequencing error profiling	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	✓	✓
Processing function	PE or SE	SE	χ	✓	SE	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
	Detect adapter automatically	χ	χ	χ	χ	χ	χ	χ	χ	✓	χ	χ	✓	χ	✓	✓
	Cut adapter	χ	χ	χ	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
	Convert quality	χ	χ	χ	χ	✓	✓	✓	✓	✓	✓	✓	✓	✓	χ	✓
	Quality filtering	χ	χ	χ	✓	χ	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
	Trimming 3′ or 5’	χ	χ	χ	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
	Sequence complexity	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	✓	χ	✓	χ	✓
	UMI	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	χ	✓
	Compressed file	✓	✓	✓=	✓≈	✓	✓	✓	✓	✓	✓	χ	✓	✓=	✓	✓

✓, software has this function; χ, software does not have this function; ^*, software has this function, but no parameters were provided; #, SE; =, only input; ≈, only output.

B, install with BioConda or running Docker containers built by BioContainers; C, install by compiling from source; D, Docker container independent of BioContainer; F, free from compile by providing a pre-compiled binary file or script of Perl; P, install with pip.

‘Interactive’ row indicates reports that include interactive figures that can respond to user events like mouse hover.

‘Time’ was the average running time of processing 3 WES samples of Illumina paired-end sequencing.

QC tools

Dedicated QC tools such as FastQC and ReadQC (a module in NGS-Bits) are excellent for highlighting potential problems in sequence data [8, 9]. FastQC illustrates QC metrics exhaustively and efficiently by producing graphs and tables in an HTML-based permanent report which includes read and base counts, read length, base composition distribution, GC content distribution and quality score curve statistics. In addition, FastQC checks the quality score encoding scheme and provides warnings and error thresholds that can be defined by the user. Conversely, the ReadQC report is relatively simple, consisting of read and base counts, read length, Q20, Q30 and two plots in qcML format. FastQC was faster than ReadQC in the testing using WES data, owing to the down-sampling strategy in FastQC.

Identifying sequencing quality issues by checking the reports of each sample in a cohort study can be inconvenient. MultiQC solves this problem by creating an individual report that can flexibly integrate quality reports from various QC tools and sample sets, allowing overall trends and deviations in large-scale sample studies to be identified rapidly and efficiently [10]. Therefore, we recommend utilising MultiQC in cohort studies to obtain insights into the quality of all samples and avoid batch effects.

Pre-processing tools

Pre-processing tools manipulate FASTQ files by trimming adapters, filtering low-quality sequences, correcting base errors, etc., to prepare the files for alignment. The most significant of these features are adapter detection and trimming because adapter sequences will be read when the DNA fragment is shorter than the specified sequencing length, resulting in reads with unfavourable adapter sequences, known as adapter contamination. FASTX-Toolkit, Cutadapt, TrimGalore, Trimmomatic, Skewer and SOAPnuke are all common examples of pre-processing software designed to recognise and remove adapters based on sequence-matching algorithms that require a series of adapter sequences [11–15]. Conversely, paired-end adapter trimmer (PEAT) detects adapter sequences with indefinite length and a higher detection accuracy by using an overlap analysis algorithm that exploits overlapping fragments of paired forward and reverse reads [16]. Compared with sequence matching-based algorithms, overlap analysis-based algorithms have the distinct advantage of being able to detect and remove shorter adapters. Because the overlap analysis-based method does not require prior adapter sequences, it is particularly useful for analysing large-scale cancer genomics data from multiple different technical platforms that may use different adapters.

Low-quality sequences should also be discarded from FASTQ files to ensure reliable interpretation; therefore, careful attention should be paid to the quality of the encoding scheme. For instance, Phred33 is encoded with an ASCII character add by 33 (ASCII 0–62) and is used by the Sanger/Illumina 1.9+ sequencer, whereas Phred64 is encoded by 64 (ASCII 59–126) and is used before the Illumina pipeline CASAVA 1.3 [17]. FastQC automatically detects and reports a sequencer version, which can be used to infer ASCII quality encoding [8]. Most raw sequence data pre-processing tools support both quality encoding schemes and can convert from Phred64 to Phred33, except for the FASTX-Toolkit, which can only process Phred64 FASTQ files. In addition, Trimmomatic can convert from Phred33 to Phred64 and can automatically check the quality format if the encoding scheme has not been specified by the user. To provide further insight, we tested these tools with three WES samples of Illumina paired-end sequencing and identified numerous additional beneficial features, including installation, report format and support for paired-end reads of these pre-processing tools (Table 1). This summarised table would help practitioners compare the tools more easily and select one that is most suitable for certain applications; for instance, SOAPnuke can be scaled to multiple working nodes for parallel computing using a Hadoop MapReduce framework to cope with a vast volume of sequencing data [14]. Further, if researchers intend to view the quality trend of multiple samples in a large-scale study, they can quickly locate the software ‘MultiQC’ through the ‘Multi-sample’ line in ‘Development feature’; if data originates from different sequencers or the adapters of sequencing batches are unknown, one can readily find ‘PEAT’, ‘NGS QC Toolkit’, ‘AfterQC’ and ‘Fastp’ to automatically detect adapters through the line of ‘Detect adapter automatically’ in ‘Processing functions’.

All-in-one tools

Neither QC tools nor pre-processing tools alone can address and process all the issues in raw sequence data simultaneously; therefore, the conventional approach is to use FastQC for QC, Cutadapt to remove adapters and Trimmomatic to filter reads as a QC-‘Pre-process-QC (optional)’ workflow. TrimGalore integrates both pre-processing and QC functions by internally using Cutadapt to perform adapter trimming and quality filtering and then FastQC on the resulting files to provide a quality report [15]. However, TrimGalore runs faster than Cutadapt in paired-end sequencing by processing each read individually in single-end mode and then performing a ‘validation’ run for both trimmed files. Such all-in-one tools have significantly improved raw data analysis by performing QC, adapter trimming, quality filtering and many other processes in a single FASTQ file scan, which serves to significantly simplify the work associated with raw data processing.

PRINSEQ is the first all-in-one processing tool available as a command-line program or web interface [18]. However, its lack of support for multithreading and compression formats means that it is very time-consuming when completing statistical reports for large datasets, thus dramatically limiting its use in cancer genomics research. While the NGS QC Toolkit supports both Illumina and Roche 454 platforms and generates an HTML-based report after processing, it runs slowly because its parallel mode uses one thread for each separate input file [19]. Fastp and AfterQC both produce interactive reports before and after pre-processing, thereby enabling users to hover over points or zoom in on figures to permit direct scrutiny. Moreover, both automatically detect and remove adapters using an overlap analysis algorithm [20, 21]. Additionally, both Fastp and AfterQC evaluate sequencing errors in paired-end data and correct possible wrong bases, which is essential for low-frequency variant calling. Unfortunately, AfterQC runs relatively slowly; hence, it is overly time-consuming when processing large FASTQ files, such as high-depth sequence data [20]. Conversely, Fastp has comprehensive functions and can carry out effective manipulations simultaneously and is thus is suitable for automated raw data processing in the routine analysis of cancer genomics data.

In summary, although QC-only tools are highly efficient for reporting quality metrics, all-in-one tools outperform separate QC, pre-processing tools or tools based on a combination of the two via integrating functionalities to address more problems without sacrificing efficiency effectively. As shown in Figure 1, we, therefore, recommend FastQC for producing single sample quality reports, MutltiQC for producing the overall statistics of a set of samples and all-in-one tools for the QC and pre-processing of raw sequence data due to their efficient operation and functionalities.

Sequence alignment and QC

A fundamental part of cancer genome sequencing is the alignment or mapping of short DNA reads to a human reference genome. The main difficulty of this process is mapping each short read to a large reference genome while distinguishing between true variants and sequencing errors. To overcome this challenge and handle growing quantities of short DNA sequence reads, many alignment tools have been proposed since 2008. Fonseca et al. maintained an up-to-date compendium of over 80 aligners, of which at least 40 can be used to align short DNA sequences [22].

Cancer genome sequencing is primarily launched with scope for paired-end short reads. It is essential to consider the input data characteristics that a mapper is designed for when selecting mappers. For instance, mappers (such as Bowtie2 and BWA MEM) that exploit sequencing quality scores can reduce alignment errors by assigning weaker penalties for mismatches in locations with low-quality scores; and the support for paired-end reads helps to detect alignment errors as well as improve sensitivity and specificity [22].

The different output characteristics of each mapper should also be taken into consideration because mapping quality scores and their thresholds vary between mappers. BWA-MEM assigns each read with a mapping quality score between 0 and 70 to reflect the accuracy of its mapping; a higher score represents more accurate mapping results [23]. However, Bowtie and Bowtie2 use a mapping quality range of 1–255, which relates to ‘mapping uniqueness’; a higher mapping quality is assigned to more unique alignments [24, 25].

After the original alignment of short reads, several recalibration steps are carried out, including the sorting of reads by alignment position, the marking or removal of duplicated reads, the remapping of reads around short insertions and deletions (indels) and the recalibration of read quality. These steps are designed to make the alignment more accurate and prepare for the variant calling step; however, if misused, more false positives will be introduced.

A notable example of a recalibration step is local realignment. Indels are common and functionally essential types of sequence polymorphism. Unlike single-nucleotide polymorphisms (SNPs), indels make sequence mapping more complicated, resulting in alignment artefacts that may lead to a high false-positive rate when detecting single-nucleotide variations (SNVs) around indels. Performing local realignment before variant detection significantly improves read alignments, reduces the false-positive rate, provides enhanced performance for detection of indels and results in higher accuracy in the calculation of variant allele frequency (VAF) [26, 27]. DePristo et al.’s research address the benchmarked effects of indel realignment as well as mathematics behind the algorithms [3].

To reduce the false-positive rate in SNV detection and to improve the accuracy of indel detection, the local realignment on indel regions, using tools such as GATK, SRMA and ABRA (Figure 1A), is commonly performed [26–28]. An alternative strategy is the new variant caller VarDict, which internally triggers local realignment in indel regions and simultaneously identifies mismatches as evidence of indels, thereby improving the accuracy of indel detection [29]. In addition, the command ‘mpileup’ in SAMtools utilises a per-base alignment quality score (parameter ‘B’), which is the Phred-scaled probability of a read base being misaligned, to help reduce the number of false SNVs caused by mismatches around indels [30].

It is essential to bear in mind that the performing local realignment repeatedly may cause over-correction. In a work by Guo et al., the consecutive application of BAQ and GATK’s local realignment functions did not reduce the number of false-positive SNPs but, instead, introduced additional false positives [31]. The GATK team have eliminated realignment in the Mutect2 and HaplotypeCaller pipeline for GATK 3.7 and higher and have adopted a strategy of read reassembly in active regions covering indels in their new pipeline (Figure 1B). Similarly, we do not recommend using callers with a realignment function for input files that have been processed using a separate local realignment tool (Figure 1C).

Cancer genome sequencing mainly involves WGS, WES and gene-panel sequencing for specific genes associated with cancer risk. Thus, it is important to determine the different QC parameters for these types of sequencing data.

The main QC parameter for WGS data is average depth (also known as the sequencing coverage), which is required for the accurate and reliable calling of variants and the estimation of VAF. However, average depth is easily skewed by a short region with a significantly higher depth, which can be quickly identified from depth distribution with an extended tail if regions of extremely high depth exist. Thus, it is more meaningful to exclude extremely high-depth regions when calculating average depth.

Because most of the known pathogenic mutations are found in the coding regions, it is more practical to limit data sequencing and analysis to regions of exons or a set of genes relevant to particular cancer [32]. Thus, WES and gene-panel sequencing are being incorporated into clinical practice because they are easier to manage in terms of cost and complexity of subsequent analysis. For capture-based target sequencing technology, it is important to consider capture efficiency. Clark et al. compared three hybridisation techniques for WES, namely, Nextera and TruSeq from Illumina Inc, SureSelect from Agilent Technologies and SeqCap EZ from Roche NimbleGen and found at least 90% capture efficiency in all three [33]. The calculation of capture efficiency is more meaningful when ignoring bases with low sequencing/mapping quality or low depth, as these bases do not provide sufficient evidence for variant calling. Hence, Bamdst and Picard module named CollectHsMetrics estimates the efficiency using a strategy of filtering low-quality reads [34]. Low capture efficiency can indicate low complexity of library, suboptimal probe hybridisation conditions or low stringent washes.

Sample matching is checked at the alignment stage to ensure that data from the same subject are correctly labelled. NGSCheckMate supports input files in FASTQ, BAM and variant call format (VCF) for sample matching verification. However, it is time-consuming for files in FASTQ format as the software needs to search for a k-mer sequence spanning specific SNPs to obtain read counts and calculate VAFs [35]. BAM-matcher is an alternative tool that provides a rapid pairwise comparison of BAM files by comparing the genotypes at pre-determined genomic locations [36]; however, the software does not support sample consistency examination for cohort studies, unlike NGSCheckMate. Alternatively, Somalier is designed to rapidly measure sample relatedness with BAM files by extracting informative genetic variation to a binary format called ‘sketch’ across 17 384 sites. It also supports reads aligned to genome build GRCh37 and GRCh38. However, understanding the HTML report of Somalier requires bioinformatics expertise [37]. Additional tools supporting the sample matching check include Peddy, seqCAT and HYSYS; however, they only check the matching with VCF files [38–40].

Several tools are used to perform QC and manipulate alignment data. For instance, SAMtools provides universal tools for processing mapping data, such as sorting (sort), indexing (index), depth statistics (depth) and alignment viewing (view) [23], whereas Qualimap is a platform-independent application that examines alignment data and provides a holistic view of the alignments to help identify mapping bias [41]. Additionally, MapQC in NGS-Bits inspects mapping information and provides results in the figures and tables [9]. Picard outperforms SAMtools when sorting, marking or removing duplicates from aligned sequences and can be used as an integrated part of GATK in addition to being a stand-alone application.

In summary, the alignment process is an important step in cancer genome sequencing. Although many programs map short DNA sequencing reads to a reference genome, we recommend BWA (BWA MEM) as it works well in conjunction with downstream analysis tools. For example, GATK integrates BWA into their pipelines and calls variants based on mapping scores of BWA. Moreover, BWA has been adopted in many large studies, such as TCGA and Pan-Cancer Analysis of Whole Genomes (PCAWG); we then recommend Picard to sort and remove duplicate reads. Considering the local realignment function of callers in the variation identification step, researchers should be aware whether and how to conduct local realignment and then complete base quality recalibration using GATK. Qualimap could also be used to provide detailed alignment quality statistics, while sample consistency should be checked with NGSCheckMate in tumour-normal pairwise cohort studies or with Somalier in cancer or germline studies, which is essential in cancer genomics research. All of these recommendations concerning the alignment stage are provided in Figure 1.

Variant detection and filtering

The most common application of cancer genome sequencing in clinical settings is the detection of short and structural variants with respect to a human reference genome, which has greatly stimulated the development of variant detection software [29, 42–50]. To improve the accuracy of detection, many callers also provide a corresponding filtering strategy; however, callers usually differ in their detection algorithms, filtering methods and thus output, particularly when handling sequence data with various characteristics (e.g. sequencing platform, capture regions, depth).

To date, a considerable number of studies have been published comparing the sensitivity, accuracy and performance of different variant detection software for WGS data, WES data and gene-panel sequencing data. For instance, Sandmann et al. evaluated eight tools (VarScan, GATK HaplotypeCaller, Platypus, LoFreq, SAMtools, FreeBayes, SNVer and VarDict) using both real and simulated non-matched datasets [51], whereas Krøigård et al. focused on five real tumour-normal paired samples and nine callers (EBCall, MuTect2, Seurat, Shimmer, Indelocator, Somatic Sniper, Strelka, VarScan2 and Virmid) [52]. The sensitivity of callers to sequencing depth has been investigated for both WES and targeted deep sequencing data; Xu et al. compared the performance of five callers (GATK UnifiedGenotyper, MuTect2, Strelka, SomaticSniper and VarScan2) on SNV detection using amplicon and WES data [53]. However, the versions of the tools evaluated in these studies may not be the most recent, and some callers only detect one variant signal (SNVs or indels). Further, there is a lack of ground truth for some real datasets.

Here, we limited our comparison to four continuously updated and commonly used tools, namely, VarDict (version 1.5.7 in Java), Mutect2 (GATK 4.1.1), VarScan2 (version 2.4.3) and Strelka2 (version 2.9.10) because these are always kept up-to-date and can detect both SNVs and indels in one run. We used four tumour-normal samples with ground truth and standard baseline filtering (VAF > 0.01, depth of both tumour and normal data were at least 500×) provided by the National Center for Clinical Laboratories in China (NCCL). The samples were sequenced using an Illumina Hiseq 2500 platform with a target region of 1.76 million bases and an average sequencing depth of 2700×. The test results are shown in Figure 2 (A, tumour-normal matched samples; B, tumour-only samples; C, agreement of callers) and indicate that VarDict performs the best in terms of precision and recall rate for tumour-normal matched samples.

$Performace of variant calling tools. (A) Performance of tools (VarDict, Mutect2, VarScan2 and Strelka2) on short somatic variant calling with tumour-normal samples (gene-panel sequencing samples B1701-B17NC, B1702-B17NC, B1703-B17NC, B1704-B17NC). (B) Performance of tools (VarDict and Mutect2) on short somatic variant calling tools with tumour-only samples (gene-panel sequencing samples B1701, B1702, B1703, B1704). The size of the circle correlates with precision: larger size corresponds to the higher precision of the tool. The colour of the circle represents the recall rate: darker colour (pink in tumour-normal samples; blue in tumour-only samples) means a higher recall rate. (C) Pairwise comparisons between VarDict, Mutect2, VarScan2 and Strelka2 for the gene-panel sequencing of B1701. The matrix depicts agreement between the variant callers. The numbers in the first four columns were calculated from variants without baseline filtering (Raw), while the numbers in the last four columns were calculated from baseline filtered variants (VAF ≥ 0.01, depth ≥ 500). In each horizontal line, the number reflects the fraction of calls identified by the caller that was also reported by the other callers. For instance, 0.6554 in the first line indicates that Mutect2 reports 65.54% of the calls reported by VarDict. The colour reflects the degree of agreement: greater colour intensity indicates higher agreement between the two callers.$

Figure 2

Performace of variant calling tools. (A) Performance of tools (VarDict, Mutect2, VarScan2 and Strelka2) on short somatic variant calling with tumour-normal samples (gene-panel sequencing samples B1701-B17NC, B1702-B17NC, B1703-B17NC, B1704-B17NC). (B) Performance of tools (VarDict and Mutect2) on short somatic variant calling tools with tumour-only samples (gene-panel sequencing samples B1701, B1702, B1703, B1704). The size of the circle correlates with precision: larger size corresponds to the higher precision of the tool. The colour of the circle represents the recall rate: darker colour (pink in tumour-normal samples; blue in tumour-only samples) means a higher recall rate. (C) Pairwise comparisons between VarDict, Mutect2, VarScan2 and Strelka2 for the gene-panel sequencing of B1701. The matrix depicts agreement between the variant callers. The numbers in the first four columns were calculated from variants without baseline filtering (Raw), while the numbers in the last four columns were calculated from baseline filtered variants (VAF ≥ 0.01, depth ≥ 500). In each horizontal line, the number reflects the fraction of calls identified by the caller that was also reported by the other callers. For instance, 0.6554 in the first line indicates that Mutect2 reports 65.54% of the calls reported by VarDict. The colour reflects the degree of agreement: greater colour intensity indicates higher agreement between the two callers.

We further reviewed available and relevant published literature evaluating variant detection tools in cancer genomes and analysed the results for the four different tools we have examined. We conclude that, firstly, the low overall inter-caller agreement is observed without variant screening criteria. Sandmann et al. showed that most variants could only be identified by one tool, with results supported by two callers only accounting for around 10% of the results supported by one caller, suggesting that each tool yields many false positives. The agreement is even lower for indels [51]. Secondly, the sensitivity of each caller varies on the sequencing depth. SNV detection by Shimmer and Somatic Sniper is greatly affected by depth; EBcall, Virmid, MuTect2 and Strelka are relatively conservative; and Seurat and Indelocator are more sensitive for indels than EBcall and Strelka. Improved depth correlates with the consistency of each caller; thus depth can improve the accuracy of software to some extent [52]. Thirdly, there is a tremendous difference between caller detection limits. LoFreq, FreeBayes, VarDict and SNVer can identify variants with a VAF of ≦5%: MuTect2, 5–10%; VarScan, 15–20%; and SAMtools, >20% [52]. Notably, the consistency of each caller increases with increasing detection limits. Fourthly, no tool except for VarDict can recognise complex variations, with most callers reporting multiple adjacent variants instead of multiple nucleotide variants (MNVs). For example, a true complex ‘Inframe indel’ variant reported in ClinVar, NC_000017.10:g.37880997delinsTTAT, which occurred in the ERBB2 gene exon, was only called by VarDict (Figure 3); however, variants occurring close to or at the same genome position (chr17:37880997) were detected by Mutect2 and VarScan2 as two individual variants (‘TTA’ insertion at chr17:37880996 and ‘G > T’ SNV at chr17:37880997) and were identified by Strelka2 as ‘G > T’ at chr17:37880997. These results show that VarDict can intuitively detect and represent complex mutations, unlike other callers which require further manual review.

Figure 3

A complex multinucleotide variant at chr17:37880997 in the human reference genome hg19. (A) Integrative Genomics Viewer (IGV) screenshot of reads covering the variant. Each read is represented by a grey bar. The purple vertical lines on reads represent insertion events, and the different bases are coloured. For example, the red ‘T’ represents a single-nucleotide variant (SNV) from G to T. (B) Calls of VarDict, Mutect2, VarScan2 and Strelka2.

Open in new tab Download slide

An effective strategy for improving the accuracy of variant detection is to integrate the variants called by multiple callers, for instance, adopting the union or intersection of results from various callers for SNVs and selecting indels that are called by at least two of three callers to ensure a higher accuracy [54]. Open-source callers based on this idea are already available, such as CAKE and GenomeVIP [4, 55].

Although filtering variants cannot remedy true-negatives, some strategies can be conducted to reduce the number of false positives. Most variant callers have corresponding filtering methods; not all methods are suitable for particular studies. This has been observed in the case of GATK’s Variant Quality Score Recalibration (VQSR), which uses at least 5000 variants that coincide with a known mutation set (HapMap) to train a Gaussian mixture model and then distinguishes true variants from false positives [56]. Variants from WGS or WES (of region size of at least 50 M) data work well; however, anything smaller, such as that from gene-panel sequencing, may encounter difficulties or errors. Similarly, the best practice of somatic SNV and indel calling in GATK introduces ‘panel of normal’ (PON), which requires blood samples from at least 40 healthy young people. Furthermore, the most critical selection criteria for ‘normals’ are technically similar to those of tumours (same exome or genome preparation methods and sequencing technology). The rigorous standards of running VQSR and building a PON pose significant challenges to research in small- to mid-sized laboratories. Therefore, although we recommend using multiple callers to detect variants, we strongly discourage filtering false positives using a crossover method.

In this study, we also performed the mutual verification between variant callers for tumour samples without matched controls (Figure 2B), revealing a dramatically high false-positive rate and indicating that variant detection in an individual tumour sample requires more work and exploration.

The focus of QC during variant calling is the heterozygosity/homozygosity ratio, transition/transversion (Ti/Tv) ratio and mutation load, which help understand variants from a sample level. For WGS data, the heterozygosity/homozygosity ratio for variants in Hardy-Weinberg equilibrium is 2.0, with a striking bias indicating errors such as unpaired tumour-normal samples [31]. The Ti/Tv ratio which is approximately 3.0 for SNVs in exons and around 2.0 for non-exons is another QC indicator of overall SNV quality that differs by genomic region [57]. In addition, the mutation number per million bases is an essential indicator of false-positive rates, with Alexandrov showing that the prevalence of somatic mutations varies from 0.001 to over 400 per million bases between and within cancer categories with adequate samples [58].

Currently, there are only a few analysis tools that comprehensively focus on variant calling QC for different applications and purposes; however, many variant callers provide modules for checking certain aspects. For example, a module named ‘Calculate Contamination’ in GATK 4.0 provides fractional contamination to inform downstream filtering, whereas SnpEff, QC3 and VariantQC in NGS-Bits are tools for profiling variant quality that provide the Ti/Tv and non-reference homozygosity ratios as well as cross-sample contamination checks [9, 59, 60]. Picard provides modules processing VCF files such as FixVcfHeader, MergeVcfs and SortVcf. In addition, NGSCheckMate can achieve faster sample consistency checking using VCF files. However, VCF files must be generated by variant callers; if sample mismatches are found in the calling stage, the previous processing steps make no sense but are a waste of resource, especially the multi-step alignment and variant calling which cost much time and storage. Therefore, although FASTQ and VCF files can also be used to detect sample consistency, applying BAM files is a compromise strategy which can not only save time but also find mismatches as early as possible to avoid dirty work.

Variant annotation and visualisation

Variant annotation is the aggregation and summarisation of data relevant to a candidate genomic alteration and is a crucial step in cancer genome sequencing that is crucial for the ultimate prediction of functional effects. Importantly, variant annotation can decipher variants from miscellaneous aspects of basic information (such as gene symbols and functional classification) to disease-specific predictions and drug sensitivity derived from databases to pinpoint a small subset of potential disease-causing mutations for further interpretation.

Genetic variant databases of large-scale populations, such as 1000 Genomes, GnomAD (The Genome Aggregation Database), ESP6500 (NHLBI GO Exome Sequencing Project v. 6500) and ExAC (Exome Aggregation Consortium) aggregate populations and allele frequency, allowing investigators to distinguish somatic mutations from germline variations [61–64]. GnomAD contains WGS samples from more than 15 496 individuals and WES samples from over 123 316 individuals and is currently recognised as the most significant genetic variant research database. Databases associated with genetic diseases include ClinVar, HGMD (The Human Gene Mutation Database) and OMIM (Online Mendelian Inheritance in Man). ClinVar is a freely accessible archive of relationships between human genome variations and phenotypes with evidence from expert reviews [65–67], whereas HGMD collates all published gene lesions related to human inherited disease from approximately 250 journals. OMIM contains mutations identified in published materials from research worldwide. COSMIC (Catalogue of Somatic Mutations in Cancer) contains variants related to human cancer and is currently the most comprehensive expert-curated global dataset for exploring the effect of somatic mutations on cancer. In addition, prediction tools such as SIFT, PolyPhen2, GERP++, CADD and REVEL are used to evaluate the malignancy and pathogenicity of missense variants [68–72]. Table 2 summarises the commonly used databases and prediction tools that facilitate access to and communication about variation and disease.

Table 2

Databases and predictors used in annotations

Databases or software		Name	Public	Samples or variants	Type	URL
Frequency database	For variant frequency in WGS	1000 Genomes: The 1000 Genomes Project dataset [61]	Yes	~84.40 million variants from 2504 samples	WGS	http://www.internationalgenome.org/data
		Kaviar: Known VARiants [73]	Yes	~170.00 million variants from 77 781 samples	~132 000 WGS, ~644 000 WES	http://db.systemsbiology.net/kaviar/
		Haplotype Reference Consortium database [74]	Yes	~40.00 million variants from 32 000 samples	Panels	http://www.haplotype-reference-consortium.org/
		69 Genomes Data: Allele frequency in 69 human subjects	Yes	69 samples	WGS	https://www.completegenomics.com/public-data/69-genomes/
		gnomAD: The Genome Aggregation Database [62]	Yes	138 632 samples	15 496 WGS	https://macarthurlab.org/2017/02/27/the-genome-aggregation-database-gnomad/
	For variant frequency in WES	gnomAD: The Genome Aggregation Database (gnomAD) [62]	Yes	138 632 samples	123 136 WES	https://macarthurlab.org/2017/02/27/the-genome-aggregation-database-gnomad/
		ESP6500: NHLBI GO Exome Sequencing Project v. 6500 [63]	Yes	6503 samples	WES	http://evs.gs.washington.edu/EVS/
		ExAC: Exome Aggregation Consortium dataset [64]	Yes	60 706 samples	WES	http://exac.broadinstitute.org/
Functional prediction tools	For functional variant prediction in WGS	(Software) GERP++: Genomic Evolutionary Rate Profiling [70]	Yes	~9.00 million variants	WGS	http://mendel.stanford.edu/SidowLab/downloads/gerp/
		(Software) CADD: Combined Annotation Dependent Depletion [71]	Yes	~9.00 million variants	All	https://cadd.gs.washington.edu/
		(Software) DANN: A deep learning approach for annotating the pathogenicity of genetic variants [75]	Yes	—	All	https://dbpia.nl.go.kr/bioinformatics/article/31/5/761/2748191
		(Software) Fathmm: Functional Analysis Through Hidden Markov Models [76]	Yes	~9.00 million variants	Coding and noncoding	http://fathmm.biocompute.org.uk/
		(Software) EIGEN: A spectral approach integrating functional genomic annotations for coding and noncoding variants [77]	Yes	~9.00 million variants	Coding and noncoding	http://www.columbia.edu/~ii2135/eigen.html
		(Software) GWAVA: Eigen scores [78]	Yes	—	Noncoding	https://www.sanger.ac.uk/science/tools/gwava
	For functional variant prediction in WES	dbNSFP: An annotation database for assembled non-synonymous SNPs [79]	Yes	—	WES	https://sites.google.com/site/jpopgen/dbNSFP
	For functional splice variant prediction	scsnv: A likelihood score that a variant affects splicing [80]	Yes	—	Splicing	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4267638/
	For functional splice variant prediction	SPIDEX: Splicing Index [81]	Yes	—	Splicing	http://tools.genes.toronto.edu/
Disease database or tools	For disease-specific variants	ClinVar: Clinical Variants [65]	Yes	~0.43 million variants	Coding	https://www.clinicalgenome.org/data-sharing/clinvar/
		COSMIC: The Catalogue of Somatic Mutations in Cancer [82]	Yes	~3.00 million variants	Coding and noncoding	https://cancer.sanger.ac.uk/cosmic/
		ICGC: International Cancer Genome Consortium	Yes	~68.00 million variants	Coding	https://icgc.org/
	For variant identifiers	dbSNP: The NCBI database of genetic variation [83]	Yes	~60.00 million variants	WGS	https://www.ncbi.nlm.nih.gov/snp/
	For variant identifiers	OMIM: Online Mendelian Inheritance in Man [67]	No	~0.03 million variants	WGS	https://omim.org/

Databases or software		Name	Public	Samples or variants	Type	URL
Frequency database	For variant frequency in WGS	1000 Genomes: The 1000 Genomes Project dataset [61]	Yes	~84.40 million variants from 2504 samples	WGS	http://www.internationalgenome.org/data
		Kaviar: Known VARiants [73]	Yes	~170.00 million variants from 77 781 samples	~132 000 WGS, ~644 000 WES	http://db.systemsbiology.net/kaviar/
		Haplotype Reference Consortium database [74]	Yes	~40.00 million variants from 32 000 samples	Panels	http://www.haplotype-reference-consortium.org/
		69 Genomes Data: Allele frequency in 69 human subjects	Yes	69 samples	WGS	https://www.completegenomics.com/public-data/69-genomes/
		gnomAD: The Genome Aggregation Database [62]	Yes	138 632 samples	15 496 WGS	https://macarthurlab.org/2017/02/27/the-genome-aggregation-database-gnomad/
	For variant frequency in WES	gnomAD: The Genome Aggregation Database (gnomAD) [62]	Yes	138 632 samples	123 136 WES	https://macarthurlab.org/2017/02/27/the-genome-aggregation-database-gnomad/
		ESP6500: NHLBI GO Exome Sequencing Project v. 6500 [63]	Yes	6503 samples	WES	http://evs.gs.washington.edu/EVS/
		ExAC: Exome Aggregation Consortium dataset [64]	Yes	60 706 samples	WES	http://exac.broadinstitute.org/
Functional prediction tools	For functional variant prediction in WGS	(Software) GERP++: Genomic Evolutionary Rate Profiling [70]	Yes	~9.00 million variants	WGS	http://mendel.stanford.edu/SidowLab/downloads/gerp/
		(Software) CADD: Combined Annotation Dependent Depletion [71]	Yes	~9.00 million variants	All	https://cadd.gs.washington.edu/
		(Software) DANN: A deep learning approach for annotating the pathogenicity of genetic variants [75]	Yes	—	All	https://dbpia.nl.go.kr/bioinformatics/article/31/5/761/2748191
		(Software) Fathmm: Functional Analysis Through Hidden Markov Models [76]	Yes	~9.00 million variants	Coding and noncoding	http://fathmm.biocompute.org.uk/
		(Software) EIGEN: A spectral approach integrating functional genomic annotations for coding and noncoding variants [77]	Yes	~9.00 million variants	Coding and noncoding	http://www.columbia.edu/~ii2135/eigen.html
		(Software) GWAVA: Eigen scores [78]	Yes	—	Noncoding	https://www.sanger.ac.uk/science/tools/gwava
	For functional variant prediction in WES	dbNSFP: An annotation database for assembled non-synonymous SNPs [79]	Yes	—	WES	https://sites.google.com/site/jpopgen/dbNSFP
	For functional splice variant prediction	scsnv: A likelihood score that a variant affects splicing [80]	Yes	—	Splicing	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4267638/
	For functional splice variant prediction	SPIDEX: Splicing Index [81]	Yes	—	Splicing	http://tools.genes.toronto.edu/
Disease database or tools	For disease-specific variants	ClinVar: Clinical Variants [65]	Yes	~0.43 million variants	Coding	https://www.clinicalgenome.org/data-sharing/clinvar/
		COSMIC: The Catalogue of Somatic Mutations in Cancer [82]	Yes	~3.00 million variants	Coding and noncoding	https://cancer.sanger.ac.uk/cosmic/
		ICGC: International Cancer Genome Consortium	Yes	~68.00 million variants	Coding	https://icgc.org/
	For variant identifiers	dbSNP: The NCBI database of genetic variation [83]	Yes	~60.00 million variants	WGS	https://www.ncbi.nlm.nih.gov/snp/
	For variant identifiers	OMIM: Online Mendelian Inheritance in Man [67]	No	~0.03 million variants	WGS	https://omim.org/

Table 2

Databases and predictors used in annotations

Databases or software		Name	Public	Samples or variants	Type	URL
Frequency database	For variant frequency in WGS	1000 Genomes: The 1000 Genomes Project dataset [61]	Yes	~84.40 million variants from 2504 samples	WGS	http://www.internationalgenome.org/data
		Kaviar: Known VARiants [73]	Yes	~170.00 million variants from 77 781 samples	~132 000 WGS, ~644 000 WES	http://db.systemsbiology.net/kaviar/
		Haplotype Reference Consortium database [74]	Yes	~40.00 million variants from 32 000 samples	Panels	http://www.haplotype-reference-consortium.org/
		69 Genomes Data: Allele frequency in 69 human subjects	Yes	69 samples	WGS	https://www.completegenomics.com/public-data/69-genomes/
		gnomAD: The Genome Aggregation Database [62]	Yes	138 632 samples	15 496 WGS	https://macarthurlab.org/2017/02/27/the-genome-aggregation-database-gnomad/
	For variant frequency in WES	gnomAD: The Genome Aggregation Database (gnomAD) [62]	Yes	138 632 samples	123 136 WES	https://macarthurlab.org/2017/02/27/the-genome-aggregation-database-gnomad/
		ESP6500: NHLBI GO Exome Sequencing Project v. 6500 [63]	Yes	6503 samples	WES	http://evs.gs.washington.edu/EVS/
		ExAC: Exome Aggregation Consortium dataset [64]	Yes	60 706 samples	WES	http://exac.broadinstitute.org/
Functional prediction tools	For functional variant prediction in WGS	(Software) GERP++: Genomic Evolutionary Rate Profiling [70]	Yes	~9.00 million variants	WGS	http://mendel.stanford.edu/SidowLab/downloads/gerp/
		(Software) CADD: Combined Annotation Dependent Depletion [71]	Yes	~9.00 million variants	All	https://cadd.gs.washington.edu/
		(Software) DANN: A deep learning approach for annotating the pathogenicity of genetic variants [75]	Yes	—	All	https://dbpia.nl.go.kr/bioinformatics/article/31/5/761/2748191
		(Software) Fathmm: Functional Analysis Through Hidden Markov Models [76]	Yes	~9.00 million variants	Coding and noncoding	http://fathmm.biocompute.org.uk/
		(Software) EIGEN: A spectral approach integrating functional genomic annotations for coding and noncoding variants [77]	Yes	~9.00 million variants	Coding and noncoding	http://www.columbia.edu/~ii2135/eigen.html
		(Software) GWAVA: Eigen scores [78]	Yes	—	Noncoding	https://www.sanger.ac.uk/science/tools/gwava
	For functional variant prediction in WES	dbNSFP: An annotation database for assembled non-synonymous SNPs [79]	Yes	—	WES	https://sites.google.com/site/jpopgen/dbNSFP
	For functional splice variant prediction	scsnv: A likelihood score that a variant affects splicing [80]	Yes	—	Splicing	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4267638/
	For functional splice variant prediction	SPIDEX: Splicing Index [81]	Yes	—	Splicing	http://tools.genes.toronto.edu/
Disease database or tools	For disease-specific variants	ClinVar: Clinical Variants [65]	Yes	~0.43 million variants	Coding	https://www.clinicalgenome.org/data-sharing/clinvar/
		COSMIC: The Catalogue of Somatic Mutations in Cancer [82]	Yes	~3.00 million variants	Coding and noncoding	https://cancer.sanger.ac.uk/cosmic/
		ICGC: International Cancer Genome Consortium	Yes	~68.00 million variants	Coding	https://icgc.org/
	For variant identifiers	dbSNP: The NCBI database of genetic variation [83]	Yes	~60.00 million variants	WGS	https://www.ncbi.nlm.nih.gov/snp/
	For variant identifiers	OMIM: Online Mendelian Inheritance in Man [67]	No	~0.03 million variants	WGS	https://omim.org/

Databases or software		Name	Public	Samples or variants	Type	URL
Frequency database	For variant frequency in WGS	1000 Genomes: The 1000 Genomes Project dataset [61]	Yes	~84.40 million variants from 2504 samples	WGS	http://www.internationalgenome.org/data
		Kaviar: Known VARiants [73]	Yes	~170.00 million variants from 77 781 samples	~132 000 WGS, ~644 000 WES	http://db.systemsbiology.net/kaviar/
		Haplotype Reference Consortium database [74]	Yes	~40.00 million variants from 32 000 samples	Panels	http://www.haplotype-reference-consortium.org/
		69 Genomes Data: Allele frequency in 69 human subjects	Yes	69 samples	WGS	https://www.completegenomics.com/public-data/69-genomes/
		gnomAD: The Genome Aggregation Database [62]	Yes	138 632 samples	15 496 WGS	https://macarthurlab.org/2017/02/27/the-genome-aggregation-database-gnomad/
	For variant frequency in WES	gnomAD: The Genome Aggregation Database (gnomAD) [62]	Yes	138 632 samples	123 136 WES	https://macarthurlab.org/2017/02/27/the-genome-aggregation-database-gnomad/
		ESP6500: NHLBI GO Exome Sequencing Project v. 6500 [63]	Yes	6503 samples	WES	http://evs.gs.washington.edu/EVS/
		ExAC: Exome Aggregation Consortium dataset [64]	Yes	60 706 samples	WES	http://exac.broadinstitute.org/
Functional prediction tools	For functional variant prediction in WGS	(Software) GERP++: Genomic Evolutionary Rate Profiling [70]	Yes	~9.00 million variants	WGS	http://mendel.stanford.edu/SidowLab/downloads/gerp/
		(Software) CADD: Combined Annotation Dependent Depletion [71]	Yes	~9.00 million variants	All	https://cadd.gs.washington.edu/
		(Software) DANN: A deep learning approach for annotating the pathogenicity of genetic variants [75]	Yes	—	All	https://dbpia.nl.go.kr/bioinformatics/article/31/5/761/2748191
		(Software) Fathmm: Functional Analysis Through Hidden Markov Models [76]	Yes	~9.00 million variants	Coding and noncoding	http://fathmm.biocompute.org.uk/
		(Software) EIGEN: A spectral approach integrating functional genomic annotations for coding and noncoding variants [77]	Yes	~9.00 million variants	Coding and noncoding	http://www.columbia.edu/~ii2135/eigen.html
		(Software) GWAVA: Eigen scores [78]	Yes	—	Noncoding	https://www.sanger.ac.uk/science/tools/gwava
	For functional variant prediction in WES	dbNSFP: An annotation database for assembled non-synonymous SNPs [79]	Yes	—	WES	https://sites.google.com/site/jpopgen/dbNSFP
	For functional splice variant prediction	scsnv: A likelihood score that a variant affects splicing [80]	Yes	—	Splicing	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4267638/
	For functional splice variant prediction	SPIDEX: Splicing Index [81]	Yes	—	Splicing	http://tools.genes.toronto.edu/
Disease database or tools	For disease-specific variants	ClinVar: Clinical Variants [65]	Yes	~0.43 million variants	Coding	https://www.clinicalgenome.org/data-sharing/clinvar/
		COSMIC: The Catalogue of Somatic Mutations in Cancer [82]	Yes	~3.00 million variants	Coding and noncoding	https://cancer.sanger.ac.uk/cosmic/
		ICGC: International Cancer Genome Consortium	Yes	~68.00 million variants	Coding	https://icgc.org/
	For variant identifiers	dbSNP: The NCBI database of genetic variation [83]	Yes	~60.00 million variants	WGS	https://www.ncbi.nlm.nih.gov/snp/
	For variant identifiers	OMIM: Online Mendelian Inheritance in Man [67]	No	~0.03 million variants	WGS	https://omim.org/

Currently, there are four leading tools used to annotate DNA variants: Annovar, SnpEff, VEP (Variant Effect Predictor) and Oncotator (Table 3) [59, 84–86]. Annovar can flexibly annotate SNVs or SNPs, indels and CNVs according to continuously updated publicly available resources and user-contributed datasets. Also, Annovar annotates variants by gene-based, region-based and filter-based methods and can handle gene definitions from RefSeq, UCSC, Ensembl and GENCODE. Moreover, Annovar can accept variants in VCF as input and convert variant lists in VCF into a specific format using its internal scripts.

Table 3

Annotators for variant annotation

Annotator	Language	Human reference genome	Input format	Output format	Prepare input function	Download database function	Self-build database	Filtering function	Statistics filtering	Variant type	Web-interface
VEP	Perl	GRCh38(HG38)/GRCh37(hg19)	VCF	VCF	χ	✓	✓	✓	χ	SNVs;indels;CNVs;SVs	χ
Oncotator	Python	GRCh37(hg19)	VCF;TCGAMAF;MAFLITE	VCF;MAFLITE	χ	χ	✓	χ	χ	SNVs;indels	✓ (http://www.broadinstitute.org/oncotator)
SnpEff	Java	GRCh38/GRCh37(hg19)	VCF	VCF	✓	✓	χ	χ	✓	SNVs;indels	χ
Annovar	Perl	GRCh38/GRCh37(hg19)	VCF; Annovar input	VCF; Annovar output	✓	✓	✓	✓	χ	SNPs; indels;CNVs	✓ (http://asia.ensembl.org/Tools/VEP)

Annotator	Language	Human reference genome	Input format	Output format	Prepare input function	Download database function	Self-build database	Filtering function	Statistics filtering	Variant type	Web-interface
VEP	Perl	GRCh38(HG38)/GRCh37(hg19)	VCF	VCF	χ	✓	✓	✓	χ	SNVs;indels;CNVs;SVs	χ
Oncotator	Python	GRCh37(hg19)	VCF;TCGAMAF;MAFLITE	VCF;MAFLITE	χ	χ	✓	χ	χ	SNVs;indels	✓ (http://www.broadinstitute.org/oncotator)
SnpEff	Java	GRCh38/GRCh37(hg19)	VCF	VCF	✓	✓	χ	χ	✓	SNVs;indels	χ
Annovar	Perl	GRCh38/GRCh37(hg19)	VCF; Annovar input	VCF; Annovar output	✓	✓	✓	✓	χ	SNPs; indels;CNVs	✓ (http://asia.ensembl.org/Tools/VEP)

Table 3

Annotators for variant annotation

Annotator	Language	Human reference genome	Input format	Output format	Prepare input function	Download database function	Self-build database	Filtering function	Statistics filtering	Variant type	Web-interface
VEP	Perl	GRCh38(HG38)/GRCh37(hg19)	VCF	VCF	χ	✓	✓	✓	χ	SNVs;indels;CNVs;SVs	χ
Oncotator	Python	GRCh37(hg19)	VCF;TCGAMAF;MAFLITE	VCF;MAFLITE	χ	χ	✓	χ	χ	SNVs;indels	✓ (http://www.broadinstitute.org/oncotator)
SnpEff	Java	GRCh38/GRCh37(hg19)	VCF	VCF	✓	✓	χ	χ	✓	SNVs;indels	χ
Annovar	Perl	GRCh38/GRCh37(hg19)	VCF; Annovar input	VCF; Annovar output	✓	✓	✓	✓	χ	SNPs; indels;CNVs	✓ (http://asia.ensembl.org/Tools/VEP)

Annotator	Language	Human reference genome	Input format	Output format	Prepare input function	Download database function	Self-build database	Filtering function	Statistics filtering	Variant type	Web-interface
VEP	Perl	GRCh38(HG38)/GRCh37(hg19)	VCF	VCF	χ	✓	✓	✓	χ	SNVs;indels;CNVs;SVs	χ
Oncotator	Python	GRCh37(hg19)	VCF;TCGAMAF;MAFLITE	VCF;MAFLITE	χ	χ	✓	χ	χ	SNVs;indels	✓ (http://www.broadinstitute.org/oncotator)
SnpEff	Java	GRCh38/GRCh37(hg19)	VCF	VCF	✓	✓	χ	χ	✓	SNVs;indels	χ
Annovar	Perl	GRCh38/GRCh37(hg19)	VCF; Annovar input	VCF; Annovar output	✓	✓	✓	✓	χ	SNPs; indels;CNVs	✓ (http://asia.ensembl.org/Tools/VEP)

Meanwhile, Oncotator integrates data from 14 public resources to annotate SNVs and indels with data relevant to cancer studies and has been used in the Broad Institute’s Cancer Genome Analysis pipeline; thus it is crucial for several large-scale projects conducted by TCGA, NHGRI and TARGET. Oncotator is also available as a python module for local use and is accessible as a web service that can deal with multiple input formats, including VCF, TCGAMAF (mutation annotation format) and MAFLITE. However, due to its processing of the ‘INFO’ field, the MAF files generated from the VCF files of various callers (such as VarDict and GATK) display different column titles, column numbers and column order [87]. Thus, to ensure the consistency of the annotation results, variants in VCF need to be converted into the MAFLITE format, requiring additional manual scripts. Oncotator only supports the GRCh37 (hg19) human reference genome, and no update has been released for over 3 years.

SnpEff is a toolbox that performs genetic variant annotation and functional effect prediction, whereas SnpSift manipulates the variants annotated by SnpEff, thus filtering large genomic datasets and identifying the most significant variants. ClinEff is an innovative tool released as a professional version of the SnpEff and SnpSift suites and is more suitable for clinical uses and production operations [88, 89]. However, ClinEff suites only process VCF input files and produce VCF files with a new field, ‘ANN’, while requiring additional configuration before annotation which may require professional IT knowledge.

Lastly, VEP annotates SNPs, indels, CNVs and SVs (structural variants) by providing up to 32 variant classifications and the corresponding filtering or statistical scripts. VEP also reads both compressed (gzipped) and uncompressed input files of multiple formats, such as VCF and HGVS identifiers, and returns annotations in tab-delimited, VCF and JSON formats. Simultaneously, VEP generates an HTML file containing statistics on the annotation results, presenting a general report of command line parameters, running time and information about the number of variants, genes, transcripts and regulatory features overlapped by the input variants.

Due to the different algorithms, transcripts and reporting strategies, the results of various annotation tools differ significantly in their transcript and genome features. For instance, VEP reports all possible transcripts, whereas Annovar only returns the transcripts with the most severe consequence according to its internal precedence, as demonstrated by McCarthy et al. [90]. We, therefore, recommend utilising VEP to annotate tumour-normal paired cancer genome variants due to its advantage of filtering scripts suite and providing a statistical report and then utilize Vcf2maf converting variants in VCF to TCGA MAF [91], which is often required by many subsequent software such as MuSiC2 for detecting significantly mutated genes, MutSigCV for identification of driver genes and Hotspot3D for determining protein structure [92–94].

The visual inspection of aligned reads to validate and interpret candidate variants is an essential aspect of cancer genome sequencing. Visualisation tools display aligned reads with mapping characteristics (e.g. mapping quality, depth, strand) and variants with annotation information from various databases in a user-friendly, smooth, intuitive and responsive method. We have divided the most prevalent visualisation tools into web servers and stand-alone applications. Ensembl Genome Browser, JBrowse Genome Browser, UCSC Genome Browser and VEGA Genome Browser are web servers that display the uploaded data in the context of online resources while avoiding the hassle of downloading and complex installation [95–98]. These servers can deal with user-defined tracks by either uploading a file or specifying a remote URL; however, they require additional data handling such as indexing and sorting files into a fixed format that may be difficult for inexperienced practitioners. Due to limitations such as network bandwidth, cybersecurity and privacy issues with respect to uploading large sequence data, web-based visualisation tools are more suitable for temporarily viewing data in a small region of interest or exploring published datasets included in public databases. Conversely, Artemis, IGV (Integrative Genomics Viewer) and Savant (Sequence Annotation and Visualisation and Analysis Tool) are stand-alone visualisation tools that can flexibly display variants with interactive browsing and zooming features [99–101]. Although there is no need to access large volume data remotely, these methods do require that annotation resources are downloaded and updated to allow meaningful visualisation and validation. In addition, the high memory and computing requirements for large volume data such as WGS samples require computers or clusters equipped with large memory and storage. Thus, stand-alone visualisation tools are suitable for experienced analysts with an IT background viewing sequence data in large-scale clinical cancer sequencing.

Conclusion and discussion

The process of translating obscure cancer genomic data to clinical practice involves a series of large and complicated steps, including foundational analysis to detect variants, advanced interpretation to reveal molecular profiles and multi-omics approaches to obtain biological insights. We limited this review to foundational analysis, an interdisciplinary area of bioinformatics and computer science, and conducted a comprehensive and systematic investigation of the tools, strategies and published literature regarding cancer genome sequencing, emphasising the importance of QC in each step. Although we investigated and compared many tools for cancer genome sequencing, other aspects, such as sample extraction, library preparation and sequencing processes, are of equal importance. Here, we limited the study to the analyses of sequencing data from Illumina sequencers. QC should be applied in each step of the cancer genome sequencing data analysis, i.e. raw data pre-processing, alignment, variant calling and annotation in different resolutions from a single sample, the cohort level (Figure 1). The recommendations we proposed will serve as valuable guidelines for both skilled and inexperienced practitioners in making decisions regarding the appropriate tools and optimal steps for specific applications, with the anticipation of promoting the development of precision medicine.

Open-source software, workflows and public resources have greatly boosted the mining of cancer genomes; however, there are a variety of tools available for each step of the analysis that can handle a diverse range of genomic data formats (FASTQ, BAM, SAM, VCF, MAF) and are attracting increasing attention to be more widely used in routine analysis. This is illustrated by some of the tools discussed in this review; for instance, NGSCheckMate validates sample identity from FASTQ, BAM and VCF input files and is effective for different data types (WES, WGS and RNA-seq), whereas VarDict detects a wide range of variant signals including SNVs, indels, CNVs and SVs with a single scan of aligned sequences. Similar tools with versatile features also include NGS-Bits, MultiQC and IGV.

There are several other aspects that can potentially influence cancer genome sequencing. Firstly, the entire foundational analysis of genomic data is a coherent process, wherein the results of the first step directly affect the accuracy of subsequent procedures. Therefore, tool developers should provide not only the functions and features of their own tool but also compatibility with upstream tools. Secondly, test datasets (such as NA12878) with various characteristics, ground truth and filtering standards sequenced from real samples should be improved or released to provide uniform and objective evaluation criteria. Thirdly, large-scale studies should aim to eliminate imbalanced population diversity in public databases as Landry et al. found significantly fewer studies in the Asian (29%) and other populations (4%; e.g. African and Latin American) than European populations (67%) [102]. Cancer genome sequencing relies on population-based resources to remove common genomic polymorphisms; therefore, population imbalance may have little effect on cancer investigations in underrepresented populations. However, it is necessary to establish genomic polymorphism databases in specific populations.

Key Points

Several popular pipelines or software in cancer genome sequencing data analysis cannot be used in all applications and should be selected based on specific research needs.
Local realignment step is very critical in the whole pipeline, but whether this step should be performed depends on the variant caller chosen in the next step.
Quality control should be conducted in each step using appropriate software to ensure accuracy of the analysis.

Funding

National Natural Science Foundation of China (grant number 31771466); National Key R&D Program of China (grant numbers 2018YFB0203903, 2016YFC0503607, and 2016YFB0200300); Special Project of Informatization of the Chinese Academy of Sciences, China (grant number XXH13504-08).

Conflict of interest

The authors have declared that no conflict of interests exist.

Xiaoyu He, Ph.D., Computer Network Information Center, Chinese Academy of Sciences; University of Chinese Academy of Sciences, is mainly engaged in cancer genomics research and construction of the Chinese cancer genome database.

Shanyu Chen, Computer Network Information Center, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Beijing 100190, China.

Ruilin Li, Computer Network Information Center, Chinese Academy of Sciences; Beijing 100190, China.

Xinyin Han, Computer Network Information Center, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Beijing 100190, China.

Zhipeng He, Computer Network Information Center, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Beijing 100190, China.

Danyang Yuan, Computer Network Information Center, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Beijing 100190, China.

Shuying Zhang, Computer Network Information Center, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Beijing 100190, China.

Xiaohong Duan, ChosenMed Technology (Beijing) Co., Ltd., Beijing 100176, China.

Beifang Niu, Computer Network Information Center, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Beijing 100190, China; ChosenMed Technology (Beijing) Co., Ltd., Beijing 100176, China.

References

1.

Challis

D

,

Yu

J

,

Evani

US

, et al.

An integrative variant analysis suite for whole exome next-generation sequencing data

.

BMC Bioinform

2012

;

13

:

8

.

2.

Team G

.

Getting started with GATK4

. https://gatk.broadinstitute.org/hc/en-us/articles/360036194592-Getting-started-with-GATK4 (12 January 2020, date last accessed)

.

3.

DePristo

MA

,

Banks

E

,

Poplin

R

, et al.

A framework for variation discovery and genotyping using next-generation DNA sequencing data

.

Nat Genet

2011

;

43

:

491

–

8

.

4.

Mashl

RJ

,

Scott

AD

,

Huang

K-L

, et al.

GenomeVIP: a cloud platform for genomic variant discovery and interpretation

.

Genome Res

2017

;

27

:

1450

–

9

.

5.

Yakneen

S

,

Waszak

SM

,

Yakneen

S

, et al.

Butler enables rapid cloud-based analysis of thousands of human genomes

.

Nat Biotechnol

2020

;

38

:

288

–

92

.

6.

Zhao

S

,

Prenger

K

,

Smith

L

, et al.

Rainbow: a tool for large-scale whole-genome sequencing data analysis using cloud computing

.

BMC Genomics

2013

;

14

:

425

.

7.

Elshazly

H

,

Souilmi

Y

,

Tonellato

PJ

, et al.

MC-GenomeKey: a multicloud system for the detection and annotation of genomic variants

.

BMC Bioinform

2017

;

18

:

49

.

8.

Andrews

S.

Babraham bioinformatics—FastQC a quality control tool for high throughput sequence data

. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ (8 January 2019, date last accessed)

.

9.

Schroeder

CM

,

Hilke

FJ

,

Löffler

MW

, et al.

A comprehensive quality control workflow for paired tumor-normal NGS experiments

.

Bioinformatics

2017

;

33

:

1721

–

2

.

10.

Ewels

P

,

Magnusson

M

,

Lundin

S

, et al.

MultiQC: summarize analysis results for multiple tools and samples in a single report

.

Bioinformatics

2016

;

32

:

3047

–

8

.

11.

Gordon

A

,

Hannon

GJ

.

Fastx-toolkit

.

2010

.

12.

Martin

M

.

Cutadapt removes adapter sequences from high-throughput sequencing reads

.

EMBnet.J

2011

;

17

:

10

–

2

.

13.

Jiang

H

,

Lei

R

,

Ding

S-W

, et al.

Skewer: a fast and accurate adapter trimmer for next-generation sequencing paired-end reads

.

BMC Bioinform

2014

;

15

:

182

.

14.

Chen

Y

,

Chen

Y

,

Shi

C

, et al.

SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data

.

Giga Science

2017

;

7

:

1

–

6

.

15.

Krueger

F.

Trim galore!

. http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/ (1 March 2019, date last accessed)

.

16.

Li

YL

,

Weng

JC

,

Hsiao

CC

, et al.

PEAT: an intelligent and efficient paired-end sequencing adapter trimming algorithm

.

BMC Bioinform

2015

;

16

:

2

.

17.

Ewing

B

,

Hillier

L

,

Wendl

MC

, et al.

Base-calling of automated sequencer traces using Phred. I Accuracy assessment

.

Genome Res

1998

;

8

:

175

–

85

.

18.

Schmieder

R

,

Edwards

R

.

Quality control and preprocessing of metagenomic datasets

.

Bioinformatics

2011

;

27

:

863

–

4

.

19.

Patel

RK

,

Jain

M

.

NGS QC toolkit: a toolkit for quality control of next generation sequencing data

.

PLoS One

2012

;

7

:

e30619

.

20.

Zhou

Y

,

Chen

Y

,

Chen

S

, et al.

Fastp: an ultra-fast all-in-one FASTQ preprocessor

.

Bioinformatics

2018

;

34

:

i884

–

90

.

21.

Chen

S

,

Huang

T

,

Zhou

Y

, et al.

AfterQC: automatic filtering, trimming, error removing and quality control for fastq data

.

BMC Bioinform

2017

;

18

:

80

–

0

.

22.

Fonseca

NA

,

Rung

J

,

Brazma

A

, et al.

Tools for mapping high-throughput sequencing data

.

Bioinformatics

2012

;

28

:

3169

–

77

.

23.

Li

H

,

Handsaker

B

,

Wysoker

A

, et al.

The sequence alignment/map format and SAMtools

.

Bioinformatics

2009

;

25

:

2078

–

9

.

24.

Langmead

B

,

Salzberg

SL

.

Fast gapped-read alignment with bowtie 2

.

Nat Methods

2012

;

9

:

357

.

25.

Langmead

B

,

Trapnell

C

,

Pop

M

, et al.

Ultrafast and memory-efficient alignment of short DNA sequences to the human genome

.

Genome Biol

2009

;

10

:

25

.

26.

Mose

LE

,

Wilkerson

MD

,

Hayes

DN

, et al.

ABRA: improved coding indel detection via assembly-based realignment

.

Bioinformatics

2014

;

30

:

2813

–

5

.

27.

Homer

N

,

Nelson

SF

.

Improved variant discovery through local re-alignment of short-read next-generation sequencing data using SRMA

.

Genome Biol

2010

;

11

:

R99

.

28.

Shlee

.

Changing workflows around calling SNPs and indels

. https://software.broadinstitute.org/gatk/blog?id=7847 (7 May 2019, date last accessed)

.

29.

Lai

Z

,

Markovets

A

,

Ahdesmaki

M

, et al.

VarDict: a novel and versatile variant caller for next-generation sequencing in cancer research

.

Nucleic Acids Res

2016

;

44

:

e108

–

8

.

30.

Li

H

.

Improving SNP discovery by base alignment quality

.

Bioinformatics

2011

;

27

:

1157

–

8

.

31.

Guo

Y

,

Ye

F

,

Sheng

Q

, et al.

Three-stage quality control strategies for DNA re-sequencing data

.

Brief Bioinform

2014

;

15

:

879

–

89

.

32.

Rabbani

B

,

Tekin

M

,

Mahdieh

N

.

The promise of whole-exome sequencing in medical genetics

.

J Hum Genet

2014

;

59

:

5

–

15

.

33.

Clark

MJ

,

Chen

R

,

Lam

HYK

, et al.

Performance comparison of exome DNA sequencing technologies

.

Nat Biotechnol

2011

;

29

:

908

–

14

.

34.

Shiquan

.

Bamdst: a BAM depth stat tool

. https://github.com/shiquan/bamdst (13 Febrary 2020, date last accessed)

.

35.

Park

WY

,

Lee

S

,

Ouellette

S

, et al.

NGSCheckMate: software for validating sample identity in next-generation sequencing studies within and across data types

.

Nucleic Acids Res

2017

;

45

:

e103

–

3

.

36.

Wang

PPS

,

Parker

WT

,

Branford

S

, et al.

BAM-matcher: a tool for rapid NGS sample matching

.

Bioinformatics

2016

;

32

:

2699

–

701

.

37.

Pedersen

BS

,

Bhetariya

PJ

,

Marth

G

, et al.

Somalier: rapid relatedness estimation for cancer and germline studies using efficient genome sketches

.

bioRxiv

2019

;

839944

.

38.

Pedersen

BS

,

Quinlan

AR

.

Who's who? Detecting and resolving sample anomalies in human DNA sequencing studies with Peddy

.

Am J Hum Genet

2017

;

100

:

406

–

13

.

39.

Fasterius

E

,

Al-Khalili

SC

.

seqCAT: a bioconductor R-package for variant analysis of high throughput sequencing data

.

F1000 Res

2019

;

7

:

1466

.

40.

Schröder

J

,

Corbin

V

,

Papenfuss

AT

.

HYSYS: have you swapped your samples?

Bioinformatics

2016

;

33

:

596

–

8

.

41.

Conesa

A

,

García-Alcalde

F

,

Dopazo

J

, et al.

Qualimap: evaluating next-generation sequencing alignment data

.

Bioinformatics

2012

;

28

:

2678

–

9

.

42.

Koboldt

DC

,

Zhang

Q

,

Larson

DE

, et al.

VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing

.

Genome Res

2012

;

22

:

568

–

76

.

43.

Kim

S

,

Scheffler

K

,

Halpern

AL

, et al.

Strelka2: fast and accurate variant calling for clinical sequencing applications

.

Nat Methods

2018

;

15

:

591

–

4

.

44.

Ye

K

,

Schulz

MH

,

Long

Q

, et al.

Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads

.

Bioinformatics

2009

;

25

:

2865

–

71

.

45.

Larson

DE

,

Harris

CC

,

Chen

K

, et al.

SomaticSniper: identification of somatic point mutations in whole genome sequencing data

.

Bioinformatics

2011

;

28

:

311

–

7

.

46.

Mayrhofer

M

,

DiLorenzo

S

,

Isaksson

A

.

Patchwork: allele-specific copy number analysis of whole-genome sequenced tumor tissue

.

Genome Biol

2013

;

14

:

R24

.

47.

Talevich

E

,

Shain

AH

,

Botton

T

, et al.

CNVkit: genome-wide copy number detection and visualization from targeted DNA sequencing

.

PLoS Comput Biol

2016

;

12

:

e1004873

–

3

.

48.

Abyzov

A

,

Urban

AE

,

Snyder

M

, et al.

CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing

.

Genome Res

2011

;

21

:

974

–

84

.

49.

Yang

L

,

Luquette

LJ

,

Gehlenborg

N

, et al.

Diverse mechanisms of somatic structural variations in human cancer genomes

.

Cell

2013

;

153

:

919

–

29

.

50.

Rausch

T

,

Zichner

T

,

Schlattl

A

, et al.

DELLY: structural variant discovery by integrated paired-end and split-read analysis

.

Bioinformatics

2012

;

28

:

i333

–

9

.

51.

Sandmann

S

,

de Graaf

AO

,

Karimi

M

, et al.

Evaluating variant calling tools for non-matched next-generation sequencing data

.

Sci Rep

2017

;

7

:

43169

.

52.

Krøigård

AB

,

Thomassen

M

,

Lænkholm

A-V

, et al.

Evaluation of nine somatic variant callers for detection of somatic mutations in exome and targeted deep sequencing data

.

PLoS One

2016

;

11

:

e0151664

.

53.

Xu

H

,

DiCarlo

J

,

Satya

RV

, et al.

Comparison of somatic mutation calling methods in amplicon and whole exome sequence data

.

BMC Genomics

2014

;

15

:

244

.

54.

Weinhold

N

,

Jacobsen

A

,

Schultz

N

, et al.

Genome-wide analysis of noncoding regulatory mutations in cancer

.

Nat Genet

2014

;

46

:

1160

–

5

.

55.

Rashid

M

,

Robles-Espinoza

CD

,

Rust

AG

, et al.

Cake: a bioinformatics pipeline for the integrated analysis of somatic variants in cancer genomes

.

Bioinformatics

2013

;

29

:

2208

–

10

.

56.

Rpoplin

.

Variant quality score recalibration (VQSR)

. https://gatkforums.broadinstitute.org/gatk/discussion/39/variant-quality-score-recalibration-vqsr (29 March 2020, date last accessed)

.

57.

Wang

J

,

Raskin

L

,

Samuels

DC

, et al.

Genome measures used for quality control are dependent on gene function and ancestry

.

Bioinformatics

2015

;

31

:

318

–

23

.

58.

Alexandrov

LB

,

Nik-Zainal

S

,

Wedge

DC

, et al.

Signatures of mutational processes in human cancer

.

Nature

2013

;

500

:

415

.

59.

Cingolani

P

,

Platts

A

,

Wang

LL

, et al.

A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3

.

Fly (Austin)

6

:

2012

,

80

–

92

.

60.

Guo

Y

,

Zhao

S

,

Sheng

Q

, et al.

Multi-perspective quality control of Illumina exome sequencing data using QC3

.

Genomics

2014

;

103

:

323

–

8

.

61.

1000 Genomes Project Consortium

,

Auton

A

,

Abecasis

GR

, et al.

A global reference for human genetic variation

.

Nature

2015

;

526

:

68

–

74

.

62.

Karczewski

K

.

The genome aggregation database (gnomAD)

. http://gnomad.broadinstitute.org/ (29 March 2020, date last accessed)

.

63.

Auer Paul

L

,

Johnsen Jill

M

,

Johnson Andrew

D

, et al.

Imputation of exome sequence variants into population- based samples and blood-cell-trait-associated loci in African Americans: NHLBI GO exome sequencing project

.

Am J Hum Genet

2012

;

91

:

794

–

808

.

64.

Karczewski

KJ

,

Weisburd

B

,

Thomas

B

, et al.

The ExAC browser: displaying reference data information from over 60 000 exomes

.

Nucleic Acids Res

2017

;

45

:

D840

–

5

.

65.

Landrum

MJ

,

Lee

JM

,

Riley

GR

, et al.

ClinVar: public archive of relationships among sequence variation and human phenotype

.

Nucleic Acids Res

2014

;

42

:

D980

–

5

.

66.

Stenson

PD

,

Mort

M

,

Ball

EV

, et al.

The human gene mutation database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine

.

Hum Genet

2014

;

133

:

1

–

9

.

67.

Hamosh

A

,

Scott

AF

,

Amberger

JS

, et al.

Online mendelian inheritance in man (OMIM), a knowledgebase of human genes and genetic disorders

.

Nucleic Acids Res

2005

;

33

:

D514

–

7

.

68.

González-Pérez

A

,

López-Bigas

N

.

Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score, Condel

.

Am J Hum Genet

2011

;

88

:

440

–

9

.

69.

Ng

PC

,

Henikoff

S

.

SIFT: predicting amino acid changes that affect protein function

.

Nucleic Acids Res

2003

;

31

:

3812

–

4

.

70.

Davydov

EV

,

Goode

DL

,

Sirota

M

, et al.

Identifying a high fraction of the human genome to be under selective constraint using GERP++

.

PLoS Comput Biol

2010

;

6

:

e1001025

.

71.

Rentzsch

P

,

Witten

D

,

Cooper

GM

, et al.

CADD: predicting the deleteriousness of variants throughout the human genome

.

Nucleic Acids Res

2018

;

47

:

D886

–

94

.

72.

Ioannidis

N

,

Rothstein

J

,

Pejaver

V

, et al.

REVEL: an ensemble method for predicting the pathogenicity of rare missense variants

.

Am J Hum Genet

2016

;

99

:

877

–

85

.

73.

Glusman

G

,

Caballero

J

,

Mauldin

DE

, et al.

Kaviar: an accessible system for testing SNV novelty

.

Bioinformatics

2011

;

27

:

3216

–

7

.

74.

Iglesias

AI

,

van der Lee

SJ

,

Bonnemaijer

PWM

, et al.

Haplotype reference consortium panel: practical implications of imputations with large reference panels

.

Hum Mutat

2017

;

38

:

1025

–

32

.

75.

Quang

D

,

Chen

Y

,

Xie

X

.

DANN: a deep learning approach for annotating the pathogenicity of genetic variants

.

Bioinformatics

2015

;

31

:

761

–

3

.

76.

Shihab

HA

,

Gough

J

,

Mort

M

, et al.

Ranking non-synonymous single nucleotide polymorphisms based on disease concepts

.

Hum Genomics

2014

;

8

:

11

.

77.

Ionita-Laza

I

,

McCallum

K

,

Xu

B

, et al.

A spectral approach integrating functional genomic annotations for coding and noncoding variants

.

Nat Genet

2016

;

48

:

214

–

20

.

78.

Ritchie

GRS

,

Dunham

I

,

Zeggini

E

, et al.

Functional annotation of noncoding sequence variants

.

Nat Methods

2014

;

11

:

294

–

6

.

79.

Liu

X

,

Jian

X

,

Boerwinkle

E

.

dbNSFP: a lightweight database of human nonsynonymous SNPs and their functional predictions

.

Hum Mutat

2011

;

32

:

894

–

9

.

80.

Jian

X

,

Boerwinkle

E

,

Liu

X

.

In silico prediction of splice-altering single nucleotide variants in the human genome

.

Nucleic Acids Res

2014

;

42

:

13534

–

44

.

81.

Hsiao

Y-HE

,

Bahn

JH

,

Lin

X

, et al.

Alternative splicing modulated by genetic variants demonstrates accelerated evolution regulated by highly conserved proteins

.

Genome Res

2016

;

26

:

440

–

50

.

82.

Forbes

SA

,

Bhamra

G

,

Bamford

S

, et al.

The catalogue of somatic mutations in cancer (COSMIC)

.

Curr Protoc Hum Genet

2008

;

57

:

10.11.11

–

26

.

83.

Sherry

ST

,

Ward

M-H

,

Kholodov

M

, et al.

dbSNP: the NCBI database of genetic variation

.

Nucleic Acids Res

2001

;

29

:

308

–

11

.

84.

Wang

K

,

Li

M

,

Hakonarson

H

.

ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data

.

Nucleic Acids Res

2010

;

38

:

e164

–

4

.

85.

McLaren

W

,

Gil

L

,

Hunt

SE

, et al.

The ensembl variant effect predictor

.

Genome Biol

2016

;

17

:

122

.

86.

Ramos

AH

,

Lichtenstein

L

,

Gupta

M

, et al.

Oncotator: cancer variant annotation tool

.

Hum Mutat

2015

;

36

:

E2423

–

9

.

87.

Documentation NG

.

GDC MAF format v1.0.0

. https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/ (1 April 2020, date last accessed)

.

88.

Ruden

D

,

Cingolani

P

,

Patel

V

, et al.

Using Drosophila melanogaster as a model for Genotoxic chemical mutational studies with a new program. SnpSift

.

Front Genet

2012

;

3

:

35

.

89.

DnaMiner

.

ClinEff

. http://www.dnaminer.com/ (1 April 2020, date last accessed)

.

90.

McCarthy

DJ

,

Humburg

P

,

Kanapin

A

, et al.

Choice of transcripts and software has a large effect on variant annotation

.

Genome Med

2014

;

6

:

26

.

91.

Kettering

MS

.

vcf2maf

. https://github.com/mskcc/vcf2maf (1 April 2020, date last accessed)

.

92.

Lawrence

MS

,

Stojanov

P

,

Polak

P

, et al.

Mutational heterogeneity in cancer and the search for new cancer-associated genes

.

Nature

2013

;

499

:

214

–

8

.

93.

Niu

B

,

Scott

AD

,

Sengupta

S

, et al.

Protein-structure-guided discovery of functional mutations across 19 cancer types

.

Nat Genet

2016

;

48

:

827

–

37

.

94.

Lab LDs

.

MuSiC: mutational significance in cancer (cancer mutation analysis) version 2

. https://github.com/ding-lab/MuSiC2 (14 July 2019, date last accessed)

.

95.

Stalker

J

,

Gibbins

B

,

Meidl

P

, et al.

The ensembl web site: mechanics of a genome browser

.

Genome Res

2004

;

14

:

951

–

5

.

96.

Karolchik

D

,

Hinrichs

AS

,

Kent

WJ

.

The UCSC genome browser

.

Curr Protoc Bioinformatics

2009

;

Chapter 1

:

Unit1.4

–

1.4

.

97.

Loveland

J

.

VEGA, the genome browser with a difference

.

Brief Bioinform

2005

;

6

:

189

–

93

.

98.

Skinner

ME

,

Uzilov

AV

,

Stein

LD

, et al.

JBrowse: a next-generation genome browser

.

Genome Res

2009

;

19

:

1630

–

8

.

99.

Rutherford

K

,

Parkhill

J

,

Crook

J

, et al.

Artemis: sequence visualization and annotation

.

Bioinformatics

2000

;

16

:

944

–

5

.

100.

Robinson

JT

,

Thorvaldsdóttir

H

,

Winckler

W

, et al.

Integrative genomics viewer

.

Nat Biotechnol

2011

;

29

:

24

–

6

.

101.

Fiume

M

,

Williams

V

,

Brook

A

, et al.

Savant: genome browser for high-throughput sequencing data

.

Bioinformatics

2010

;

26

:

1938

–

44

.

102.

Landry

LG

,

Ali

N

,

Williams

DR

, et al.

Lack of diversity in genomic databases is a barrier to translating precision medicine research into practice

.

Health Aff

2018

;

37

:

780

–

5

.