Abstract

Next-generation sequencing (NGS) technology has revolutionised human cancer research, particularly via detection of genomic variants with its ultra-high-throughput sequencing and increasing affordability. However, the inundation of rich cancer genomics data has resulted in significant challenges in its exploration and translation into biological insights. One of the difficulties in cancer genome sequencing is software selection. Currently, multiple tools are widely used to process NGS data in four stages: raw sequence data pre-processing and quality control (QC), sequence alignment, variant calling and annotation and visualisation. However, the differences between these NGS tools, including their installation, merits, drawbacks and application, have not been fully appreciated. Therefore, a systematic review of the functionality and performance of NGS tools is required to provide cancer researchers with guidance on software and strategy selection. Another challenge is the multidimensional QC of sequencing data because QC can not only report varied sequence data characteristics but also reveal deviations in diverse features and is essential for a meaningful and successful study. However, monitoring of QC metrics in specific steps including alignment and variant calling is neglected in certain pipelines such as the ‘Best Practices Workflows’ in GATK. In this review, we investigated the most widely used software for the fundamental analysis and QC of cancer genome sequencing data and provided instructions for selecting the most appropriate software and pipelines to ensure precise and efficient conclusions. We further discussed the prospects and new research directions for cancer genomics.

Introduction

Next-generation sequencing (NGS) technology has produced a wealth of cancer genomics data, leading to the need for efficient and accurate software and workflow/pipelines for its processing and interpretation. For example, Atlas2 is an integrative workflow specialised for variant calling involving data of three exome capture sequencing methods (SOLiD, Illumina and Roche 454) [1]. GATK4 is a robust variant analysis pipeline and toolkit available for many types of sequencing platforms [2]. The unified analytic framework presented by DePristo allows the discovery and genotyping of variants among multiple samples, simultaneously across five sequencing technologies and three experimental designs [3]. Meanwhile, GenomeVIP is a framework for performing variant discovery and annotation from whole-genome sequencing (WGS) and whole-exome sequencing (WES) data [4]. Furthermore, cloud-based workflow system such as Butler, Rainbow and MC-GenomeKey provide solutions for facilitating population-scale human genomic studies [5–7]. However, detailed analyses of these workflow systems are not available to users, and they either focus solely on the variant discovery or lack the necessary quality control (QC) measures.

Cancer genome analysis comprises raw data pre-processing and QC, sequence alignment and variant calling, followed by variant annotation and visualisation. Although some of these software and pipelines are prevalent and widely used in a few large-scale studies, three issues remain: (i) there is a lack of professional guidance for selecting between suitable software with similar functions; (ii) certain well-known processes are not suitable for particular applications; and (iii) multi-dimensional QC is ignored.

In this review, we discussed how to select suitable software and workflow strategy as well as to conduct quality management according to the four steps detailed in Figure 1: (i) raw sequence data pre-processing and QC, (ii) sequence alignment and QC, (iii) variant detection and filtering and (iv) variant annotation and visualisation. Most research on cancer genome sequencing focuses on only one aspect of the analysis or lacks a detailed description of the quality control. This review will fill in the gaps by providing comprehensive and detailed recommendations for cancer genome sequencing analysis.

Detailed overview of the NGS data analysis pipeline with quality control. Analysis of cancer genome sequencing data of WGS/WES/Target consists of four steps: (i) raw data processing (shown in blue); (ii) alignment data processing (shown in green); (iii) short variants (SNVs/indels) calling and false-positive filtering (shown in yellow); and (iv) variant annotation and visualisation (shown in orange). Quality control (QC) for each step is listed on the right in a darker colour. Red font is used to emphasise the importance of QC reports. There are two alternative routes after unique reads have been generated. The steps denoted by orange dashed lines indicate that realignment is required when the variant caller has no realignment function. Realignment should not be conducted when choosing a variant caller with realignment function, such as VarDict or GATK version 3.7 or higher. Sample matching checks (orange arrows) can be conducted using NGSCheckMate in ‘Raw data processing’, with BAM-Matcher or NGSCheckMate in ‘Aligned data processing’, and with NGSCheckMate until ‘Variant data processing’. However, this step is only required once, and we recommend that it is conducted during ‘Aligned data processing’.
Figure 1

Detailed overview of the NGS data analysis pipeline with quality control. Analysis of cancer genome sequencing data of WGS/WES/Target consists of four steps: (i) raw data processing (shown in blue); (ii) alignment data processing (shown in green); (iii) short variants (SNVs/indels) calling and false-positive filtering (shown in yellow); and (iv) variant annotation and visualisation (shown in orange). Quality control (QC) for each step is listed on the right in a darker colour. Red font is used to emphasise the importance of QC reports. There are two alternative routes after unique reads have been generated. The steps denoted by orange dashed lines indicate that realignment is required when the variant caller has no realignment function. Realignment should not be conducted when choosing a variant caller with realignment function, such as VarDict or GATK version 3.7 or higher. Sample matching checks (orange arrows) can be conducted using NGSCheckMate in ‘Raw data processing’, with BAM-Matcher or NGSCheckMate in ‘Aligned data processing’, and with NGSCheckMate until ‘Variant data processing’. However, this step is only required once, and we recommend that it is conducted during ‘Aligned data processing’.

Raw sequence data pre-processing and QC

The QC and pre-processing of raw sequence data in the FASTQ format are necessary to obtain high-confidence results. QC generates text or graphic reports that identify issues deriving from either the sequencer or the library material, whereas pre-processing trims sequence artefacts (typically platform-specific PCR primers and adapters) and filters out substandard reads to generate clean sequences for subsequent analysis.

Several tools are publicly available for the manipulation of FASTQ files. We have summarised the functions and features of 15 of these open-source tools and have listed the number of citations (as of 9 January 2020; normalised by years since publication) to indicate their popularity in Table 1. Also, we classified these sophisticated tools into three main categories, namely, QC tools, pre-processing tools and all-in-one tools, based on the differences in their methods and design functionalities that facilitate their use by those who lack relevant analytical experience. To further distinguish these tools, we systematically explore and compare the results obtained from three WES cases, therefore allowing us to make recommendations on tool selection and prioritisation according to particular needs.

Table 1

Raw data analysis tools

Tool featureFunction(i) Quality control tools(ii) Pre-process tools(iii) All-in-one tools
SoftwareFastQCMultiQCNGS-Bits (ReadQC)FASTX-ToolkitCutadaptTrimGaloreTrimmomaticSkewerPEATSOAPnukePRINSEQNGS QC ToolkitFaQCsAfterQCFastp
Publication featureCitations/year384.20124.250.6722.30737.5637.132148.5064.504.2026.67270.00178.1311.8316.0086.00
InstallationInstallationB/FB/C/PB/C/FB/F/CB/PB/FB/FB/C/FC/FCB/FCB/CB/C/FB/C/F
PerformanceTime (hh:mm:ss)00:03:1100:00:0600:09:17NA00:10:2900:05:5601:03:2200:22:5500:59:3500:07:0808:23:2001:04:0901:09:1704:02:5000:07:03
Report featureFormathtml; texthtml; text; jsonqcML, tsvtextstdouttexttextstdouttexttexthtmlhtml; textpdf; texthtml; jsonhtml; pdf; json
VisualisationBeforePostPostPostχχχχχχPostPostPostBothBoth
Interactiveχχχχχχχχχχχχ
Development featureMulti-threadingχχχχ*
Multi-sampleχχχχχχχχχχχ
Quality profileDuplicated readsχχχχχχχχχχ
Overlapping analysisχχχχχχχχχχχχχ
Sequencing error profilingχχχχχχχχχχχχχ
Processing functionPE or SESEχSE
Detect adapter automaticallyχχχχχχχχχχχ
Cut adapterχχχ
Convert qualityχχχχχ
Quality filteringχχχχ
Trimming 3′ or 5’χχχ
Sequence complexityχχχχχχχχχχχχ
UMIχχχχχχχχχχχχχχ
Compressed file✓=✓≈χ✓=
Tool featureFunction(i) Quality control tools(ii) Pre-process tools(iii) All-in-one tools
SoftwareFastQCMultiQCNGS-Bits (ReadQC)FASTX-ToolkitCutadaptTrimGaloreTrimmomaticSkewerPEATSOAPnukePRINSEQNGS QC ToolkitFaQCsAfterQCFastp
Publication featureCitations/year384.20124.250.6722.30737.5637.132148.5064.504.2026.67270.00178.1311.8316.0086.00
InstallationInstallationB/FB/C/PB/C/FB/F/CB/PB/FB/FB/C/FC/FCB/FCB/CB/C/FB/C/F
PerformanceTime (hh:mm:ss)00:03:1100:00:0600:09:17NA00:10:2900:05:5601:03:2200:22:5500:59:3500:07:0808:23:2001:04:0901:09:1704:02:5000:07:03
Report featureFormathtml; texthtml; text; jsonqcML, tsvtextstdouttexttextstdouttexttexthtmlhtml; textpdf; texthtml; jsonhtml; pdf; json
VisualisationBeforePostPostPostχχχχχχPostPostPostBothBoth
Interactiveχχχχχχχχχχχχ
Development featureMulti-threadingχχχχ*
Multi-sampleχχχχχχχχχχχ
Quality profileDuplicated readsχχχχχχχχχχ
Overlapping analysisχχχχχχχχχχχχχ
Sequencing error profilingχχχχχχχχχχχχχ
Processing functionPE or SESEχSE
Detect adapter automaticallyχχχχχχχχχχχ
Cut adapterχχχ
Convert qualityχχχχχ
Quality filteringχχχχ
Trimming 3′ or 5’χχχ
Sequence complexityχχχχχχχχχχχχ
UMIχχχχχχχχχχχχχχ
Compressed file✓=✓≈χ✓=

✓, software has this function; χ, software does not have this function; *, software has this function, but no parameters were provided; #, SE; =, only input; ≈, only output.

B, install with BioConda or running Docker containers built by BioContainers; C, install by compiling from source; D, Docker container independent of BioContainer; F, free from compile by providing a pre-compiled binary file or script of Perl; P, install with pip.

‘Interactive’ row indicates reports that include interactive figures that can respond to user events like mouse hover.

‘Time’ was the average running time of processing 3 WES samples of Illumina paired-end sequencing.

Table 1

Raw data analysis tools

Tool featureFunction(i) Quality control tools(ii) Pre-process tools(iii) All-in-one tools
SoftwareFastQCMultiQCNGS-Bits (ReadQC)FASTX-ToolkitCutadaptTrimGaloreTrimmomaticSkewerPEATSOAPnukePRINSEQNGS QC ToolkitFaQCsAfterQCFastp
Publication featureCitations/year384.20124.250.6722.30737.5637.132148.5064.504.2026.67270.00178.1311.8316.0086.00
InstallationInstallationB/FB/C/PB/C/FB/F/CB/PB/FB/FB/C/FC/FCB/FCB/CB/C/FB/C/F
PerformanceTime (hh:mm:ss)00:03:1100:00:0600:09:17NA00:10:2900:05:5601:03:2200:22:5500:59:3500:07:0808:23:2001:04:0901:09:1704:02:5000:07:03
Report featureFormathtml; texthtml; text; jsonqcML, tsvtextstdouttexttextstdouttexttexthtmlhtml; textpdf; texthtml; jsonhtml; pdf; json
VisualisationBeforePostPostPostχχχχχχPostPostPostBothBoth
Interactiveχχχχχχχχχχχχ
Development featureMulti-threadingχχχχ*
Multi-sampleχχχχχχχχχχχ
Quality profileDuplicated readsχχχχχχχχχχ
Overlapping analysisχχχχχχχχχχχχχ
Sequencing error profilingχχχχχχχχχχχχχ
Processing functionPE or SESEχSE
Detect adapter automaticallyχχχχχχχχχχχ
Cut adapterχχχ
Convert qualityχχχχχ
Quality filteringχχχχ
Trimming 3′ or 5’χχχ
Sequence complexityχχχχχχχχχχχχ
UMIχχχχχχχχχχχχχχ
Compressed file✓=✓≈χ✓=
Tool featureFunction(i) Quality control tools(ii) Pre-process tools(iii) All-in-one tools
SoftwareFastQCMultiQCNGS-Bits (ReadQC)FASTX-ToolkitCutadaptTrimGaloreTrimmomaticSkewerPEATSOAPnukePRINSEQNGS QC ToolkitFaQCsAfterQCFastp
Publication featureCitations/year384.20124.250.6722.30737.5637.132148.5064.504.2026.67270.00178.1311.8316.0086.00
InstallationInstallationB/FB/C/PB/C/FB/F/CB/PB/FB/FB/C/FC/FCB/FCB/CB/C/FB/C/F
PerformanceTime (hh:mm:ss)00:03:1100:00:0600:09:17NA00:10:2900:05:5601:03:2200:22:5500:59:3500:07:0808:23:2001:04:0901:09:1704:02:5000:07:03
Report featureFormathtml; texthtml; text; jsonqcML, tsvtextstdouttexttextstdouttexttexthtmlhtml; textpdf; texthtml; jsonhtml; pdf; json
VisualisationBeforePostPostPostχχχχχχPostPostPostBothBoth
Interactiveχχχχχχχχχχχχ
Development featureMulti-threadingχχχχ*
Multi-sampleχχχχχχχχχχχ
Quality profileDuplicated readsχχχχχχχχχχ
Overlapping analysisχχχχχχχχχχχχχ
Sequencing error profilingχχχχχχχχχχχχχ
Processing functionPE or SESEχSE
Detect adapter automaticallyχχχχχχχχχχχ
Cut adapterχχχ
Convert qualityχχχχχ
Quality filteringχχχχ
Trimming 3′ or 5’χχχ
Sequence complexityχχχχχχχχχχχχ
UMIχχχχχχχχχχχχχχ
Compressed file✓=✓≈χ✓=

✓, software has this function; χ, software does not have this function; *, software has this function, but no parameters were provided; #, SE; =, only input; ≈, only output.

B, install with BioConda or running Docker containers built by BioContainers; C, install by compiling from source; D, Docker container independent of BioContainer; F, free from compile by providing a pre-compiled binary file or script of Perl; P, install with pip.

‘Interactive’ row indicates reports that include interactive figures that can respond to user events like mouse hover.

‘Time’ was the average running time of processing 3 WES samples of Illumina paired-end sequencing.

QC tools

Dedicated QC tools such as FastQC and ReadQC (a module in NGS-Bits) are excellent for highlighting potential problems in sequence data [8, 9]. FastQC illustrates QC metrics exhaustively and efficiently by producing graphs and tables in an HTML-based permanent report which includes read and base counts, read length, base composition distribution, GC content distribution and quality score curve statistics. In addition, FastQC checks the quality score encoding scheme and provides warnings and error thresholds that can be defined by the user. Conversely, the ReadQC report is relatively simple, consisting of read and base counts, read length, Q20, Q30 and two plots in qcML format. FastQC was faster than ReadQC in the testing using WES data, owing to the down-sampling strategy in FastQC.

Identifying sequencing quality issues by checking the reports of each sample in a cohort study can be inconvenient. MultiQC solves this problem by creating an individual report that can flexibly integrate quality reports from various QC tools and sample sets, allowing overall trends and deviations in large-scale sample studies to be identified rapidly and efficiently [10]. Therefore, we recommend utilising MultiQC in cohort studies to obtain insights into the quality of all samples and avoid batch effects.

Pre-processing tools

Pre-processing tools manipulate FASTQ files by trimming adapters, filtering low-quality sequences, correcting base errors, etc., to prepare the files for alignment. The most significant of these features are adapter detection and trimming because adapter sequences will be read when the DNA fragment is shorter than the specified sequencing length, resulting in reads with unfavourable adapter sequences, known as adapter contamination. FASTX-Toolkit, Cutadapt, TrimGalore, Trimmomatic, Skewer and SOAPnuke are all common examples of pre-processing software designed to recognise and remove adapters based on sequence-matching algorithms that require a series of adapter sequences [11–15]. Conversely, paired-end adapter trimmer (PEAT) detects adapter sequences with indefinite length and a higher detection accuracy by using an overlap analysis algorithm that exploits overlapping fragments of paired forward and reverse reads [16]. Compared with sequence matching-based algorithms, overlap analysis-based algorithms have the distinct advantage of being able to detect and remove shorter adapters. Because the overlap analysis-based method does not require prior adapter sequences, it is particularly useful for analysing large-scale cancer genomics data from multiple different technical platforms that may use different adapters.

Low-quality sequences should also be discarded from FASTQ files to ensure reliable interpretation; therefore, careful attention should be paid to the quality of the encoding scheme. For instance, Phred33 is encoded with an ASCII character add by 33 (ASCII 0–62) and is used by the Sanger/Illumina 1.9+ sequencer, whereas Phred64 is encoded by 64 (ASCII 59–126) and is used before the Illumina pipeline CASAVA 1.3 [17]. FastQC automatically detects and reports a sequencer version, which can be used to infer ASCII quality encoding [8]. Most raw sequence data pre-processing tools support both quality encoding schemes and can convert from Phred64 to Phred33, except for the FASTX-Toolkit, which can only process Phred64 FASTQ files. In addition, Trimmomatic can convert from Phred33 to Phred64 and can automatically check the quality format if the encoding scheme has not been specified by the user. To provide further insight, we tested these tools with three WES samples of Illumina paired-end sequencing and identified numerous additional beneficial features, including installation, report format and support for paired-end reads of these pre-processing tools (Table 1). This summarised table would help practitioners compare the tools more easily and select one that is most suitable for certain applications; for instance, SOAPnuke can be scaled to multiple working nodes for parallel computing using a Hadoop MapReduce framework to cope with a vast volume of sequencing data [14]. Further, if researchers intend to view the quality trend of multiple samples in a large-scale study, they can quickly locate the software ‘MultiQC’ through the ‘Multi-sample’ line in ‘Development feature’; if data originates from different sequencers or the adapters of sequencing batches are unknown, one can readily find ‘PEAT’, ‘NGS QC Toolkit’, ‘AfterQC’ and ‘Fastp’ to automatically detect adapters through the line of ‘Detect adapter automatically’ in ‘Processing functions’.

All-in-one tools

Neither QC tools nor pre-processing tools alone can address and process all the issues in raw sequence data simultaneously; therefore, the conventional approach is to use FastQC for QC, Cutadapt to remove adapters and Trimmomatic to filter reads as a QC-‘Pre-process-QC (optional)’ workflow. TrimGalore integrates both pre-processing and QC functions by internally using Cutadapt to perform adapter trimming and quality filtering and then FastQC on the resulting files to provide a quality report [15]. However, TrimGalore runs faster than Cutadapt in paired-end sequencing by processing each read individually in single-end mode and then performing a ‘validation’ run for both trimmed files. Such all-in-one tools have significantly improved raw data analysis by performing QC, adapter trimming, quality filtering and many other processes in a single FASTQ file scan, which serves to significantly simplify the work associated with raw data processing.

PRINSEQ is the first all-in-one processing tool available as a command-line program or web interface [18]. However, its lack of support for multithreading and compression formats means that it is very time-consuming when completing statistical reports for large datasets, thus dramatically limiting its use in cancer genomics research. While the NGS QC Toolkit supports both Illumina and Roche 454 platforms and generates an HTML-based report after processing, it runs slowly because its parallel mode uses one thread for each separate input file [19]. Fastp and AfterQC both produce interactive reports before and after pre-processing, thereby enabling users to hover over points or zoom in on figures to permit direct scrutiny. Moreover, both automatically detect and remove adapters using an overlap analysis algorithm [20, 21]. Additionally, both Fastp and AfterQC evaluate sequencing errors in paired-end data and correct possible wrong bases, which is essential for low-frequency variant calling. Unfortunately, AfterQC runs relatively slowly; hence, it is overly time-consuming when processing large FASTQ files, such as high-depth sequence data [20]. Conversely, Fastp has comprehensive functions and can carry out effective manipulations simultaneously and is thus is suitable for automated raw data processing in the routine analysis of cancer genomics data.

In summary, although QC-only tools are highly efficient for reporting quality metrics, all-in-one tools outperform separate QC, pre-processing tools or tools based on a combination of the two via integrating functionalities to address more problems without sacrificing efficiency effectively. As shown in Figure 1, we, therefore, recommend FastQC for producing single sample quality reports, MutltiQC for producing the overall statistics of a set of samples and all-in-one tools for the QC and pre-processing of raw sequence data due to their efficient operation and functionalities.

Sequence alignment and QC

A fundamental part of cancer genome sequencing is the alignment or mapping of short DNA reads to a human reference genome. The main difficulty of this process is mapping each short read to a large reference genome while distinguishing between true variants and sequencing errors. To overcome this challenge and handle growing quantities of short DNA sequence reads, many alignment tools have been proposed since 2008. Fonseca et al. maintained an up-to-date compendium of over 80 aligners, of which at least 40 can be used to align short DNA sequences [22].

Cancer genome sequencing is primarily launched with scope for paired-end short reads. It is essential to consider the input data characteristics that a mapper is designed for when selecting mappers. For instance, mappers (such as Bowtie2 and BWA MEM) that exploit sequencing quality scores can reduce alignment errors by assigning weaker penalties for mismatches in locations with low-quality scores; and the support for paired-end reads helps to detect alignment errors as well as improve sensitivity and specificity [22].

The different output characteristics of each mapper should also be taken into consideration because mapping quality scores and their thresholds vary between mappers. BWA-MEM assigns each read with a mapping quality score between 0 and 70 to reflect the accuracy of its mapping; a higher score represents more accurate mapping results [23]. However, Bowtie and Bowtie2 use a mapping quality range of 1–255, which relates to ‘mapping uniqueness’; a higher mapping quality is assigned to more unique alignments [24, 25].

After the original alignment of short reads, several recalibration steps are carried out, including the sorting of reads by alignment position, the marking or removal of duplicated reads, the remapping of reads around short insertions and deletions (indels) and the recalibration of read quality. These steps are designed to make the alignment more accurate and prepare for the variant calling step; however, if misused, more false positives will be introduced.

A notable example of a recalibration step is local realignment. Indels are common and functionally essential types of sequence polymorphism. Unlike single-nucleotide polymorphisms (SNPs), indels make sequence mapping more complicated, resulting in alignment artefacts that may lead to a high false-positive rate when detecting single-nucleotide variations (SNVs) around indels. Performing local realignment before variant detection significantly improves read alignments, reduces the false-positive rate, provides enhanced performance for detection of indels and results in higher accuracy in the calculation of variant allele frequency (VAF) [26, 27]. DePristo et al.’s research address the benchmarked effects of indel realignment as well as mathematics behind the algorithms [3].

To reduce the false-positive rate in SNV detection and to improve the accuracy of indel detection, the local realignment on indel regions, using tools such as GATK, SRMA and ABRA (Figure 1A), is commonly performed [26–28]. An alternative strategy is the new variant caller VarDict, which internally triggers local realignment in indel regions and simultaneously identifies mismatches as evidence of indels, thereby improving the accuracy of indel detection [29]. In addition, the command ‘mpileup’ in SAMtools utilises a per-base alignment quality score (parameter ‘B’), which is the Phred-scaled probability of a read base being misaligned, to help reduce the number of false SNVs caused by mismatches around indels [30].

It is essential to bear in mind that the performing local realignment repeatedly may cause over-correction. In a work by Guo et al., the consecutive application of BAQ and GATK’s local realignment functions did not reduce the number of false-positive SNPs but, instead, introduced additional false positives [31]. The GATK team have eliminated realignment in the Mutect2 and HaplotypeCaller pipeline for GATK 3.7 and higher and have adopted a strategy of read reassembly in active regions covering indels in their new pipeline (Figure 1B). Similarly, we do not recommend using callers with a realignment function for input files that have been processed using a separate local realignment tool (Figure 1C).

Cancer genome sequencing mainly involves WGS, WES and gene-panel sequencing for specific genes associated with cancer risk. Thus, it is important to determine the different QC parameters for these types of sequencing data.

The main QC parameter for WGS data is average depth (also known as the sequencing coverage), which is required for the accurate and reliable calling of variants and the estimation of VAF. However, average depth is easily skewed by a short region with a significantly higher depth, which can be quickly identified from depth distribution with an extended tail if regions of extremely high depth exist. Thus, it is more meaningful to exclude extremely high-depth regions when calculating average depth.

Because most of the known pathogenic mutations are found in the coding regions, it is more practical to limit data sequencing and analysis to regions of exons or a set of genes relevant to particular cancer [32]. Thus, WES and gene-panel sequencing are being incorporated into clinical practice because they are easier to manage in terms of cost and complexity of subsequent analysis. For capture-based target sequencing technology, it is important to consider capture efficiency. Clark et al. compared three hybridisation techniques for WES, namely, Nextera and TruSeq from Illumina Inc, SureSelect from Agilent Technologies and SeqCap EZ from Roche NimbleGen and found at least 90% capture efficiency in all three [33]. The calculation of capture efficiency is more meaningful when ignoring bases with low sequencing/mapping quality or low depth, as these bases do not provide sufficient evidence for variant calling. Hence, Bamdst and Picard module named CollectHsMetrics estimates the efficiency using a strategy of filtering low-quality reads [34]. Low capture efficiency can indicate low complexity of library, suboptimal probe hybridisation conditions or low stringent washes.

Sample matching is checked at the alignment stage to ensure that data from the same subject are correctly labelled. NGSCheckMate supports input files in FASTQ, BAM and variant call format (VCF) for sample matching verification. However, it is time-consuming for files in FASTQ format as the software needs to search for a k-mer sequence spanning specific SNPs to obtain read counts and calculate VAFs [35]. BAM-matcher is an alternative tool that provides a rapid pairwise comparison of BAM files by comparing the genotypes at pre-determined genomic locations [36]; however, the software does not support sample consistency examination for cohort studies, unlike NGSCheckMate. Alternatively, Somalier is designed to rapidly measure sample relatedness with BAM files by extracting informative genetic variation to a binary format called ‘sketch’ across 17 384 sites. It also supports reads aligned to genome build GRCh37 and GRCh38. However, understanding the HTML report of Somalier requires bioinformatics expertise [37]. Additional tools supporting the sample matching check include Peddy, seqCAT and HYSYS; however, they only check the matching with VCF files [38–40].

Several tools are used to perform QC and manipulate alignment data. For instance, SAMtools provides universal tools for processing mapping data, such as sorting (sort), indexing (index), depth statistics (depth) and alignment viewing (view) [23], whereas Qualimap is a platform-independent application that examines alignment data and provides a holistic view of the alignments to help identify mapping bias [41]. Additionally, MapQC in NGS-Bits inspects mapping information and provides results in the figures and tables [9]. Picard outperforms SAMtools when sorting, marking or removing duplicates from aligned sequences and can be used as an integrated part of GATK in addition to being a stand-alone application.

In summary, the alignment process is an important step in cancer genome sequencing. Although many programs map short DNA sequencing reads to a reference genome, we recommend BWA (BWA MEM) as it works well in conjunction with downstream analysis tools. For example, GATK integrates BWA into their pipelines and calls variants based on mapping scores of BWA. Moreover, BWA has been adopted in many large studies, such as TCGA and Pan-Cancer Analysis of Whole Genomes (PCAWG); we then recommend Picard to sort and remove duplicate reads. Considering the local realignment function of callers in the variation identification step, researchers should be aware whether and how to conduct local realignment and then complete base quality recalibration using GATK. Qualimap could also be used to provide detailed alignment quality statistics, while sample consistency should be checked with NGSCheckMate in tumour-normal pairwise cohort studies or with Somalier in cancer or germline studies, which is essential in cancer genomics research. All of these recommendations concerning the alignment stage are provided in Figure 1.

Variant detection and filtering

The most common application of cancer genome sequencing in clinical settings is the detection of short and structural variants with respect to a human reference genome, which has greatly stimulated the development of variant detection software [29, 42–50]. To improve the accuracy of detection, many callers also provide a corresponding filtering strategy; however, callers usually differ in their detection algorithms, filtering methods and thus output, particularly when handling sequence data with various characteristics (e.g. sequencing platform, capture regions, depth).

To date, a considerable number of studies have been published comparing the sensitivity, accuracy and performance of different variant detection software for WGS data, WES data and gene-panel sequencing data. For instance, Sandmann et al. evaluated eight tools (VarScan, GATK HaplotypeCaller, Platypus, LoFreq, SAMtools, FreeBayes, SNVer and VarDict) using both real and simulated non-matched datasets [51], whereas Krøigård et al. focused on five real tumour-normal paired samples and nine callers (EBCall, MuTect2, Seurat, Shimmer, Indelocator, Somatic Sniper, Strelka, VarScan2 and Virmid) [52]. The sensitivity of callers to sequencing depth has been investigated for both WES and targeted deep sequencing data; Xu et al. compared the performance of five callers (GATK UnifiedGenotyper, MuTect2, Strelka, SomaticSniper and VarScan2) on SNV detection using amplicon and WES data [53]. However, the versions of the tools evaluated in these studies may not be the most recent, and some callers only detect one variant signal (SNVs or indels). Further, there is a lack of ground truth for some real datasets.

Here, we limited our comparison to four continuously updated and commonly used tools, namely, VarDict (version 1.5.7 in Java), Mutect2 (GATK 4.1.1), VarScan2 (version 2.4.3) and Strelka2 (version 2.9.10) because these are always kept up-to-date and can detect both SNVs and indels in one run. We used four tumour-normal samples with ground truth and standard baseline filtering (VAF > 0.01, depth of both tumour and normal data were at least 500×) provided by the National Center for Clinical Laboratories in China (NCCL). The samples were sequenced using an Illumina Hiseq 2500 platform with a target region of 1.76 million bases and an average sequencing depth of 2700×. The test results are shown in Figure 2 (A, tumour-normal matched samples; B, tumour-only samples; C, agreement of callers) and indicate that VarDict performs the best in terms of precision and recall rate for tumour-normal matched samples.

Performace of variant calling tools. (A) Performance of tools (VarDict, Mutect2, VarScan2 and Strelka2) on short somatic variant calling with tumour-normal samples (gene-panel sequencing samples B1701-B17NC, B1702-B17NC, B1703-B17NC, B1704-B17NC). (B) Performance of tools (VarDict and Mutect2) on short somatic variant calling tools with tumour-only samples (gene-panel sequencing samples B1701, B1702, B1703, B1704). The size of the circle correlates with precision: larger size corresponds to the higher precision of the tool. The colour of the circle represents the recall rate: darker colour (pink in tumour-normal samples; blue in tumour-only samples) means a higher recall rate. (C) Pairwise comparisons between VarDict, Mutect2, VarScan2 and Strelka2 for the gene-panel sequencing of B1701. The matrix depicts agreement between the variant callers. The numbers in the first four columns were calculated from variants without baseline filtering (Raw), while the numbers in the last four columns were calculated from baseline filtered variants (VAF ≥ 0.01, depth ≥ 500). In each horizontal line, the number reflects the fraction of calls identified by the caller that was also reported by the other callers. For instance, 0.6554 in the first line indicates that Mutect2 reports 65.54% of the calls reported by VarDict. The colour reflects the degree of agreement: greater colour intensity indicates higher agreement between the two callers.
Figure 2

Performace of variant calling tools. (A) Performance of tools (VarDict, Mutect2, VarScan2 and Strelka2) on short somatic variant calling with tumour-normal samples (gene-panel sequencing samples B1701-B17NC, B1702-B17NC, B1703-B17NC, B1704-B17NC). (B) Performance of tools (VarDict and Mutect2) on short somatic variant calling tools with tumour-only samples (gene-panel sequencing samples B1701, B1702, B1703, B1704). The size of the circle correlates with precision: larger size corresponds to the higher precision of the tool. The colour of the circle represents the recall rate: darker colour (pink in tumour-normal samples; blue in tumour-only samples) means a higher recall rate. (C) Pairwise comparisons between VarDict, Mutect2, VarScan2 and Strelka2 for the gene-panel sequencing of B1701. The matrix depicts agreement between the variant callers. The numbers in the first four columns were calculated from variants without baseline filtering (Raw), while the numbers in the last four columns were calculated from baseline filtered variants (VAF ≥ 0.01, depth ≥ 500). In each horizontal line, the number reflects the fraction of calls identified by the caller that was also reported by the other callers. For instance, 0.6554 in the first line indicates that Mutect2 reports 65.54% of the calls reported by VarDict. The colour reflects the degree of agreement: greater colour intensity indicates higher agreement between the two callers.

We further reviewed available and relevant published literature evaluating variant detection tools in cancer genomes and analysed the results for the four different tools we have examined. We conclude that, firstly, the low overall inter-caller agreement is observed without variant screening criteria. Sandmann et al. showed that most variants could only be identified by one tool, with results supported by two callers only accounting for around 10% of the results supported by one caller, suggesting that each tool yields many false positives. The agreement is even lower for indels [51]. Secondly, the sensitivity of each caller varies on the sequencing depth. SNV detection by Shimmer and Somatic Sniper is greatly affected by depth; EBcall, Virmid, MuTect2 and Strelka are relatively conservative; and Seurat and Indelocator are more sensitive for indels than EBcall and Strelka. Improved depth correlates with the consistency of each caller; thus depth can improve the accuracy of software to some extent [52]. Thirdly, there is a tremendous difference between caller detection limits. LoFreq, FreeBayes, VarDict and SNVer can identify variants with a VAF of ≦5%: MuTect2, 5–10%; VarScan, 15–20%; and SAMtools, >20% [52]. Notably, the consistency of each caller increases with increasing detection limits. Fourthly, no tool except for VarDict can recognise complex variations, with most callers reporting multiple adjacent variants instead of multiple nucleotide variants (MNVs). For example, a true complex ‘Inframe indel’ variant reported in ClinVar, NC_000017.10:g.37880997delinsTTAT, which occurred in the ERBB2 gene exon, was only called by VarDict (Figure 3); however, variants occurring close to or at the same genome position (chr17:37880997) were detected by Mutect2 and VarScan2 as two individual variants (‘TTA’ insertion at chr17:37880996 and ‘G > T’ SNV at chr17:37880997) and were identified by Strelka2 as ‘G > T’ at chr17:37880997. These results show that VarDict can intuitively detect and represent complex mutations, unlike other callers which require further manual review.

A complex multinucleotide variant at chr17:37880997 in the human reference genome hg19. (A) Integrative Genomics Viewer (IGV) screenshot of reads covering the variant. Each read is represented by a grey bar. The purple vertical lines on reads represent insertion events, and the different bases are coloured. For example, the red ‘T’ represents a single-nucleotide variant (SNV) from G to T. (B) Calls of VarDict, Mutect2, VarScan2 and Strelka2.
Figure 3

A complex multinucleotide variant at chr17:37880997 in the human reference genome hg19. (A) Integrative Genomics Viewer (IGV) screenshot of reads covering the variant. Each read is represented by a grey bar. The purple vertical lines on reads represent insertion events, and the different bases are coloured. For example, the red ‘T’ represents a single-nucleotide variant (SNV) from G to T. (B) Calls of VarDict, Mutect2, VarScan2 and Strelka2.

An effective strategy for improving the accuracy of variant detection is to integrate the variants called by multiple callers, for instance, adopting the union or intersection of results from various callers for SNVs and selecting indels that are called by at least two of three callers to ensure a higher accuracy [54]. Open-source callers based on this idea are already available, such as CAKE and GenomeVIP [4, 55].

Although filtering variants cannot remedy true-negatives, some strategies can be conducted to reduce the number of false positives. Most variant callers have corresponding filtering methods; not all methods are suitable for particular studies. This has been observed in the case of GATK’s Variant Quality Score Recalibration (VQSR), which uses at least 5000 variants that coincide with a known mutation set (HapMap) to train a Gaussian mixture model and then distinguishes true variants from false positives [56]. Variants from WGS or WES (of region size of at least 50 M) data work well; however, anything smaller, such as that from gene-panel sequencing, may encounter difficulties or errors. Similarly, the best practice of somatic SNV and indel calling in GATK introduces ‘panel of normal’ (PON), which requires blood samples from at least 40 healthy young people. Furthermore, the most critical selection criteria for ‘normals’ are technically similar to those of tumours (same exome or genome preparation methods and sequencing technology). The rigorous standards of running VQSR and building a PON pose significant challenges to research in small- to mid-sized laboratories. Therefore, although we recommend using multiple callers to detect variants, we strongly discourage filtering false positives using a crossover method.

In this study, we also performed the mutual verification between variant callers for tumour samples without matched controls (Figure 2B), revealing a dramatically high false-positive rate and indicating that variant detection in an individual tumour sample requires more work and exploration.

The focus of QC during variant calling is the heterozygosity/homozygosity ratio, transition/transversion (Ti/Tv) ratio and mutation load, which help understand variants from a sample level. For WGS data, the heterozygosity/homozygosity ratio for variants in Hardy-Weinberg equilibrium is 2.0, with a striking bias indicating errors such as unpaired tumour-normal samples [31]. The Ti/Tv ratio which is approximately 3.0 for SNVs in exons and around 2.0 for non-exons is another QC indicator of overall SNV quality that differs by genomic region [57]. In addition, the mutation number per million bases is an essential indicator of false-positive rates, with Alexandrov showing that the prevalence of somatic mutations varies from 0.001 to over 400 per million bases between and within cancer categories with adequate samples [58].

Currently, there are only a few analysis tools that comprehensively focus on variant calling QC for different applications and purposes; however, many variant callers provide modules for checking certain aspects. For example, a module named ‘Calculate Contamination’ in GATK 4.0 provides fractional contamination to inform downstream filtering, whereas SnpEff, QC3 and VariantQC in NGS-Bits are tools for profiling variant quality that provide the Ti/Tv and non-reference homozygosity ratios as well as cross-sample contamination checks [9, 59, 60]. Picard provides modules processing VCF files such as FixVcfHeader, MergeVcfs and SortVcf. In addition, NGSCheckMate can achieve faster sample consistency checking using VCF files. However, VCF files must be generated by variant callers; if sample mismatches are found in the calling stage, the previous processing steps make no sense but are a waste of resource, especially the multi-step alignment and variant calling which cost much time and storage. Therefore, although FASTQ and VCF files can also be used to detect sample consistency, applying BAM files is a compromise strategy which can not only save time but also find mismatches as early as possible to avoid dirty work.

Variant annotation and visualisation

Variant annotation is the aggregation and summarisation of data relevant to a candidate genomic alteration and is a crucial step in cancer genome sequencing that is crucial for the ultimate prediction of functional effects. Importantly, variant annotation can decipher variants from miscellaneous aspects of basic information (such as gene symbols and functional classification) to disease-specific predictions and drug sensitivity derived from databases to pinpoint a small subset of potential disease-causing mutations for further interpretation.

Genetic variant databases of large-scale populations, such as 1000 Genomes, GnomAD (The Genome Aggregation Database), ESP6500 (NHLBI GO Exome Sequencing Project v. 6500) and ExAC (Exome Aggregation Consortium) aggregate populations and allele frequency, allowing investigators to distinguish somatic mutations from germline variations [61–64]. GnomAD contains WGS samples from more than 15 496 individuals and WES samples from over 123 316 individuals and is currently recognised as the most significant genetic variant research database. Databases associated with genetic diseases include ClinVar, HGMD (The Human Gene Mutation Database) and OMIM (Online Mendelian Inheritance in Man). ClinVar is a freely accessible archive of relationships between human genome variations and phenotypes with evidence from expert reviews [65–67], whereas HGMD collates all published gene lesions related to human inherited disease from approximately 250 journals. OMIM contains mutations identified in published materials from research worldwide. COSMIC (Catalogue of Somatic Mutations in Cancer) contains variants related to human cancer and is currently the most comprehensive expert-curated global dataset for exploring the effect of somatic mutations on cancer. In addition, prediction tools such as SIFT, PolyPhen2, GERP++, CADD and REVEL are used to evaluate the malignancy and pathogenicity of missense variants [68–72]. Table 2 summarises the commonly used databases and prediction tools that facilitate access to and communication about variation and disease.

Table 2

Databases and predictors used in annotations

Databases or softwareNamePublicSamples or variantsTypeURL
Frequency databaseFor variant frequency in WGS1000 Genomes: The 1000 Genomes Project dataset [61]Yes~84.40 million variants from 2504 samplesWGShttp://www.internationalgenome.org/data
Kaviar: Known VARiants [73]Yes~170.00 million variants from 77 781 samples~132 000 WGS, ~644 000 WEShttp://db.systemsbiology.net/kaviar/
Haplotype Reference Consortium database [74]Yes~40.00 million variants from 32 000 samplesPanelshttp://www.haplotype-reference-consortium.org/
69 Genomes Data: Allele frequency in 69 human subjectsYes69 samplesWGShttps://www.completegenomics.com/public-data/69-genomes/
gnomAD: The Genome Aggregation Database [62]Yes138 632 samples15 496 WGShttps://macarthurlab.org/2017/02/27/the-genome-aggregation-database-gnomad/
For variant frequency in WESgnomAD: The Genome Aggregation Database (gnomAD) [62]Yes138 632 samples123 136 WEShttps://macarthurlab.org/2017/02/27/the-genome-aggregation-database-gnomad/
ESP6500: NHLBI GO Exome Sequencing Project v. 6500 [63]Yes6503 samplesWEShttp://evs.gs.washington.edu/EVS/
ExAC: Exome Aggregation Consortium dataset [64]Yes60 706 samplesWEShttp://exac.broadinstitute.org/
Functional prediction toolsFor functional variant prediction in WGS(Software) GERP++: Genomic Evolutionary Rate Profiling [70]Yes~9.00 million variantsWGShttp://mendel.stanford.edu/SidowLab/downloads/gerp/
(Software) CADD: Combined Annotation Dependent Depletion [71]Yes~9.00 million variantsAllhttps://cadd.gs.washington.edu/
(Software) DANN: A deep learning approach for annotating the pathogenicity of genetic variants [75]YesAllhttps://dbpia.nl.go.kr/bioinformatics/article/31/5/761/2748191
(Software) Fathmm: Functional Analysis Through Hidden Markov Models [76]Yes~9.00 million variantsCoding and noncodinghttp://fathmm.biocompute.org.uk/
(Software) EIGEN: A spectral approach integrating functional genomic annotations for coding and noncoding variants [77]Yes~9.00 million variantsCoding and noncodinghttp://www.columbia.edu/~ii2135/eigen.html
(Software) GWAVA: Eigen scores [78]YesNoncodinghttps://www.sanger.ac.uk/science/tools/gwava
For functional variant prediction in WESdbNSFP: An annotation database for assembled non-synonymous SNPs [79]YesWEShttps://sites.google.com/site/jpopgen/dbNSFP
For functional splice variant predictionscsnv: A likelihood score that a variant affects splicing [80]YesSplicinghttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC4267638/
SPIDEX: Splicing Index [81]YesSplicinghttp://tools.genes.toronto.edu/
Disease database or toolsFor disease-specific variantsClinVar: Clinical Variants [65]Yes~0.43 million variantsCodinghttps://www.clinicalgenome.org/data-sharing/clinvar/
COSMIC: The Catalogue of Somatic Mutations in Cancer [82]Yes~3.00 million variantsCoding and noncodinghttps://cancer.sanger.ac.uk/cosmic/
ICGC: International Cancer Genome ConsortiumYes~68.00 million variantsCodinghttps://icgc.org/
For variant identifiersdbSNP: The NCBI database of genetic variation [83]Yes~60.00 million variantsWGShttps://www.ncbi.nlm.nih.gov/snp/
OMIM: Online Mendelian Inheritance in Man [67]No~0.03 million variantsWGShttps://omim.org/
Databases or softwareNamePublicSamples or variantsTypeURL
Frequency databaseFor variant frequency in WGS1000 Genomes: The 1000 Genomes Project dataset [61]Yes~84.40 million variants from 2504 samplesWGShttp://www.internationalgenome.org/data
Kaviar: Known VARiants [73]Yes~170.00 million variants from 77 781 samples~132 000 WGS, ~644 000 WEShttp://db.systemsbiology.net/kaviar/
Haplotype Reference Consortium database [74]Yes~40.00 million variants from 32 000 samplesPanelshttp://www.haplotype-reference-consortium.org/
69 Genomes Data: Allele frequency in 69 human subjectsYes69 samplesWGShttps://www.completegenomics.com/public-data/69-genomes/
gnomAD: The Genome Aggregation Database [62]Yes138 632 samples15 496 WGShttps://macarthurlab.org/2017/02/27/the-genome-aggregation-database-gnomad/
For variant frequency in WESgnomAD: The Genome Aggregation Database (gnomAD) [62]Yes138 632 samples123 136 WEShttps://macarthurlab.org/2017/02/27/the-genome-aggregation-database-gnomad/
ESP6500: NHLBI GO Exome Sequencing Project v. 6500 [63]Yes6503 samplesWEShttp://evs.gs.washington.edu/EVS/
ExAC: Exome Aggregation Consortium dataset [64]Yes60 706 samplesWEShttp://exac.broadinstitute.org/
Functional prediction toolsFor functional variant prediction in WGS(Software) GERP++: Genomic Evolutionary Rate Profiling [70]Yes~9.00 million variantsWGShttp://mendel.stanford.edu/SidowLab/downloads/gerp/
(Software) CADD: Combined Annotation Dependent Depletion [71]Yes~9.00 million variantsAllhttps://cadd.gs.washington.edu/
(Software) DANN: A deep learning approach for annotating the pathogenicity of genetic variants [75]YesAllhttps://dbpia.nl.go.kr/bioinformatics/article/31/5/761/2748191
(Software) Fathmm: Functional Analysis Through Hidden Markov Models [76]Yes~9.00 million variantsCoding and noncodinghttp://fathmm.biocompute.org.uk/
(Software) EIGEN: A spectral approach integrating functional genomic annotations for coding and noncoding variants [77]Yes~9.00 million variantsCoding and noncodinghttp://www.columbia.edu/~ii2135/eigen.html
(Software) GWAVA: Eigen scores [78]YesNoncodinghttps://www.sanger.ac.uk/science/tools/gwava
For functional variant prediction in WESdbNSFP: An annotation database for assembled non-synonymous SNPs [79]YesWEShttps://sites.google.com/site/jpopgen/dbNSFP
For functional splice variant predictionscsnv: A likelihood score that a variant affects splicing [80]YesSplicinghttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC4267638/
SPIDEX: Splicing Index [81]YesSplicinghttp://tools.genes.toronto.edu/
Disease database or toolsFor disease-specific variantsClinVar: Clinical Variants [65]Yes~0.43 million variantsCodinghttps://www.clinicalgenome.org/data-sharing/clinvar/
COSMIC: The Catalogue of Somatic Mutations in Cancer [82]Yes~3.00 million variantsCoding and noncodinghttps://cancer.sanger.ac.uk/cosmic/
ICGC: International Cancer Genome ConsortiumYes~68.00 million variantsCodinghttps://icgc.org/
For variant identifiersdbSNP: The NCBI database of genetic variation [83]Yes~60.00 million variantsWGShttps://www.ncbi.nlm.nih.gov/snp/
OMIM: Online Mendelian Inheritance in Man [67]No~0.03 million variantsWGShttps://omim.org/
Table 2

Databases and predictors used in annotations

Databases or softwareNamePublicSamples or variantsTypeURL
Frequency databaseFor variant frequency in WGS1000 Genomes: The 1000 Genomes Project dataset [61]Yes~84.40 million variants from 2504 samplesWGShttp://www.internationalgenome.org/data
Kaviar: Known VARiants [73]Yes~170.00 million variants from 77 781 samples~132 000 WGS, ~644 000 WEShttp://db.systemsbiology.net/kaviar/
Haplotype Reference Consortium database [74]Yes~40.00 million variants from 32 000 samplesPanelshttp://www.haplotype-reference-consortium.org/
69 Genomes Data: Allele frequency in 69 human subjectsYes69 samplesWGShttps://www.completegenomics.com/public-data/69-genomes/
gnomAD: The Genome Aggregation Database [62]Yes138 632 samples15 496 WGShttps://macarthurlab.org/2017/02/27/the-genome-aggregation-database-gnomad/
For variant frequency in WESgnomAD: The Genome Aggregation Database (gnomAD) [62]Yes138 632 samples123 136 WEShttps://macarthurlab.org/2017/02/27/the-genome-aggregation-database-gnomad/
ESP6500: NHLBI GO Exome Sequencing Project v. 6500 [63]Yes6503 samplesWEShttp://evs.gs.washington.edu/EVS/
ExAC: Exome Aggregation Consortium dataset [64]Yes60 706 samplesWEShttp://exac.broadinstitute.org/
Functional prediction toolsFor functional variant prediction in WGS(Software) GERP++: Genomic Evolutionary Rate Profiling [70]Yes~9.00 million variantsWGShttp://mendel.stanford.edu/SidowLab/downloads/gerp/
(Software) CADD: Combined Annotation Dependent Depletion [71]Yes~9.00 million variantsAllhttps://cadd.gs.washington.edu/
(Software) DANN: A deep learning approach for annotating the pathogenicity of genetic variants [75]YesAllhttps://dbpia.nl.go.kr/bioinformatics/article/31/5/761/2748191
(Software) Fathmm: Functional Analysis Through Hidden Markov Models [76]Yes~9.00 million variantsCoding and noncodinghttp://fathmm.biocompute.org.uk/
(Software) EIGEN: A spectral approach integrating functional genomic annotations for coding and noncoding variants [77]Yes~9.00 million variantsCoding and noncodinghttp://www.columbia.edu/~ii2135/eigen.html
(Software) GWAVA: Eigen scores [78]YesNoncodinghttps://www.sanger.ac.uk/science/tools/gwava
For functional variant prediction in WESdbNSFP: An annotation database for assembled non-synonymous SNPs [79]YesWEShttps://sites.google.com/site/jpopgen/dbNSFP
For functional splice variant predictionscsnv: A likelihood score that a variant affects splicing [80]YesSplicinghttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC4267638/
SPIDEX: Splicing Index [81]YesSplicinghttp://tools.genes.toronto.edu/
Disease database or toolsFor disease-specific variantsClinVar: Clinical Variants [65]Yes~0.43 million variantsCodinghttps://www.clinicalgenome.org/data-sharing/clinvar/
COSMIC: The Catalogue of Somatic Mutations in Cancer [82]Yes~3.00 million variantsCoding and noncodinghttps://cancer.sanger.ac.uk/cosmic/
ICGC: International Cancer Genome ConsortiumYes~68.00 million variantsCodinghttps://icgc.org/
For variant identifiersdbSNP: The NCBI database of genetic variation [83]Yes~60.00 million variantsWGShttps://www.ncbi.nlm.nih.gov/snp/
OMIM: Online Mendelian Inheritance in Man [67]No~0.03 million variantsWGShttps://omim.org/
Databases or softwareNamePublicSamples or variantsTypeURL
Frequency databaseFor variant frequency in WGS1000 Genomes: The 1000 Genomes Project dataset [61]Yes~84.40 million variants from 2504 samplesWGShttp://www.internationalgenome.org/data
Kaviar: Known VARiants [73]Yes~170.00 million variants from 77 781 samples~132 000 WGS, ~644 000 WEShttp://db.systemsbiology.net/kaviar/
Haplotype Reference Consortium database [74]Yes~40.00 million variants from 32 000 samplesPanelshttp://www.haplotype-reference-consortium.org/
69 Genomes Data: Allele frequency in 69 human subjectsYes69 samplesWGShttps://www.completegenomics.com/public-data/69-genomes/
gnomAD: The Genome Aggregation Database [62]Yes138 632 samples15 496 WGShttps://macarthurlab.org/2017/02/27/the-genome-aggregation-database-gnomad/
For variant frequency in WESgnomAD: The Genome Aggregation Database (gnomAD) [62]Yes138 632 samples123 136 WEShttps://macarthurlab.org/2017/02/27/the-genome-aggregation-database-gnomad/
ESP6500: NHLBI GO Exome Sequencing Project v. 6500 [63]Yes6503 samplesWEShttp://evs.gs.washington.edu/EVS/
ExAC: Exome Aggregation Consortium dataset [64]Yes60 706 samplesWEShttp://exac.broadinstitute.org/
Functional prediction toolsFor functional variant prediction in WGS(Software) GERP++: Genomic Evolutionary Rate Profiling [70]Yes~9.00 million variantsWGShttp://mendel.stanford.edu/SidowLab/downloads/gerp/
(Software) CADD: Combined Annotation Dependent Depletion [71]Yes~9.00 million variantsAllhttps://cadd.gs.washington.edu/
(Software) DANN: A deep learning approach for annotating the pathogenicity of genetic variants [75]YesAllhttps://dbpia.nl.go.kr/bioinformatics/article/31/5/761/2748191
(Software) Fathmm: Functional Analysis Through Hidden Markov Models [76]Yes~9.00 million variantsCoding and noncodinghttp://fathmm.biocompute.org.uk/
(Software) EIGEN: A spectral approach integrating functional genomic annotations for coding and noncoding variants [77]Yes~9.00 million variantsCoding and noncodinghttp://www.columbia.edu/~ii2135/eigen.html
(Software) GWAVA: Eigen scores [78]YesNoncodinghttps://www.sanger.ac.uk/science/tools/gwava
For functional variant prediction in WESdbNSFP: An annotation database for assembled non-synonymous SNPs [79]YesWEShttps://sites.google.com/site/jpopgen/dbNSFP
For functional splice variant predictionscsnv: A likelihood score that a variant affects splicing [80]YesSplicinghttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC4267638/
SPIDEX: Splicing Index [81]YesSplicinghttp://tools.genes.toronto.edu/
Disease database or toolsFor disease-specific variantsClinVar: Clinical Variants [65]Yes~0.43 million variantsCodinghttps://www.clinicalgenome.org/data-sharing/clinvar/
COSMIC: The Catalogue of Somatic Mutations in Cancer [82]Yes~3.00 million variantsCoding and noncodinghttps://cancer.sanger.ac.uk/cosmic/
ICGC: International Cancer Genome ConsortiumYes~68.00 million variantsCodinghttps://icgc.org/
For variant identifiersdbSNP: The NCBI database of genetic variation [83]Yes~60.00 million variantsWGShttps://www.ncbi.nlm.nih.gov/snp/
OMIM: Online Mendelian Inheritance in Man [67]No~0.03 million variantsWGShttps://omim.org/

Currently, there are four leading tools used to annotate DNA variants: Annovar, SnpEff, VEP (Variant Effect Predictor) and Oncotator (Table 3) [59, 84–86]. Annovar can flexibly annotate SNVs or SNPs, indels and CNVs according to continuously updated publicly available resources and user-contributed datasets. Also, Annovar annotates variants by gene-based, region-based and filter-based methods and can handle gene definitions from RefSeq, UCSC, Ensembl and GENCODE. Moreover, Annovar can accept variants in VCF as input and convert variant lists in VCF into a specific format using its internal scripts.

Table 3

Annotators for variant annotation

AnnotatorLanguageHuman reference genomeInput formatOutput formatPrepare input functionDownload database functionSelf-build databaseFiltering functionStatistics filteringVariant typeWeb-interface
VEPPerlGRCh38(HG38)/GRCh37(hg19)VCFVCFχχSNVs;indels;CNVs;SVsχ
OncotatorPythonGRCh37(hg19)VCF;TCGAMAF;MAFLITEVCF;MAFLITEχχχχSNVs;indels✓ (http://www.broadinstitute.org/oncotator)
SnpEffJavaGRCh38/GRCh37(hg19)VCFVCFχχSNVs;indelsχ
AnnovarPerlGRCh38/GRCh37(hg19)VCF; Annovar inputVCF; Annovar outputχSNPs; indels;CNVs✓ (http://asia.ensembl.org/Tools/VEP)
AnnotatorLanguageHuman reference genomeInput formatOutput formatPrepare input functionDownload database functionSelf-build databaseFiltering functionStatistics filteringVariant typeWeb-interface
VEPPerlGRCh38(HG38)/GRCh37(hg19)VCFVCFχχSNVs;indels;CNVs;SVsχ
OncotatorPythonGRCh37(hg19)VCF;TCGAMAF;MAFLITEVCF;MAFLITEχχχχSNVs;indels✓ (http://www.broadinstitute.org/oncotator)
SnpEffJavaGRCh38/GRCh37(hg19)VCFVCFχχSNVs;indelsχ
AnnovarPerlGRCh38/GRCh37(hg19)VCF; Annovar inputVCF; Annovar outputχSNPs; indels;CNVs✓ (http://asia.ensembl.org/Tools/VEP)
Table 3

Annotators for variant annotation

AnnotatorLanguageHuman reference genomeInput formatOutput formatPrepare input functionDownload database functionSelf-build databaseFiltering functionStatistics filteringVariant typeWeb-interface
VEPPerlGRCh38(HG38)/GRCh37(hg19)VCFVCFχχSNVs;indels;CNVs;SVsχ
OncotatorPythonGRCh37(hg19)VCF;TCGAMAF;MAFLITEVCF;MAFLITEχχχχSNVs;indels✓ (http://www.broadinstitute.org/oncotator)
SnpEffJavaGRCh38/GRCh37(hg19)VCFVCFχχSNVs;indelsχ
AnnovarPerlGRCh38/GRCh37(hg19)VCF; Annovar inputVCF; Annovar outputχSNPs; indels;CNVs✓ (http://asia.ensembl.org/Tools/VEP)
AnnotatorLanguageHuman reference genomeInput formatOutput formatPrepare input functionDownload database functionSelf-build databaseFiltering functionStatistics filteringVariant typeWeb-interface
VEPPerlGRCh38(HG38)/GRCh37(hg19)VCFVCFχχSNVs;indels;CNVs;SVsχ
OncotatorPythonGRCh37(hg19)VCF;TCGAMAF;MAFLITEVCF;MAFLITEχχχχSNVs;indels✓ (http://www.broadinstitute.org/oncotator)
SnpEffJavaGRCh38/GRCh37(hg19)VCFVCFχχSNVs;indelsχ
AnnovarPerlGRCh38/GRCh37(hg19)VCF; Annovar inputVCF; Annovar outputχSNPs; indels;CNVs✓ (http://asia.ensembl.org/Tools/VEP)

Meanwhile, Oncotator integrates data from 14 public resources to annotate SNVs and indels with data relevant to cancer studies and has been used in the Broad Institute’s Cancer Genome Analysis pipeline; thus it is crucial for several large-scale projects conducted by TCGA, NHGRI and TARGET. Oncotator is also available as a python module for local use and is accessible as a web service that can deal with multiple input formats, including VCF, TCGAMAF (mutation annotation format) and MAFLITE. However, due to its processing of the ‘INFO’ field, the MAF files generated from the VCF files of various callers (such as VarDict and GATK) display different column titles, column numbers and column order [87]. Thus, to ensure the consistency of the annotation results, variants in VCF need to be converted into the MAFLITE format, requiring additional manual scripts. Oncotator only supports the GRCh37 (hg19) human reference genome, and no update has been released for over 3 years.

SnpEff is a toolbox that performs genetic variant annotation and functional effect prediction, whereas SnpSift manipulates the variants annotated by SnpEff, thus filtering large genomic datasets and identifying the most significant variants. ClinEff is an innovative tool released as a professional version of the SnpEff and SnpSift suites and is more suitable for clinical uses and production operations [88, 89]. However, ClinEff suites only process VCF input files and produce VCF files with a new field, ‘ANN’, while requiring additional configuration before annotation which may require professional IT knowledge.

Lastly, VEP annotates SNPs, indels, CNVs and SVs (structural variants) by providing up to 32 variant classifications and the corresponding filtering or statistical scripts. VEP also reads both compressed (gzipped) and uncompressed input files of multiple formats, such as VCF and HGVS identifiers, and returns annotations in tab-delimited, VCF and JSON formats. Simultaneously, VEP generates an HTML file containing statistics on the annotation results, presenting a general report of command line parameters, running time and information about the number of variants, genes, transcripts and regulatory features overlapped by the input variants.

Due to the different algorithms, transcripts and reporting strategies, the results of various annotation tools differ significantly in their transcript and genome features. For instance, VEP reports all possible transcripts, whereas Annovar only returns the transcripts with the most severe consequence according to its internal precedence, as demonstrated by McCarthy et al. [90]. We, therefore, recommend utilising VEP to annotate tumour-normal paired cancer genome variants due to its advantage of filtering scripts suite and providing a statistical report and then utilize Vcf2maf converting variants in VCF to TCGA MAF [91], which is often required by many subsequent software such as MuSiC2 for detecting significantly mutated genes, MutSigCV for identification of driver genes and Hotspot3D for determining protein structure [92–94].

The visual inspection of aligned reads to validate and interpret candidate variants is an essential aspect of cancer genome sequencing. Visualisation tools display aligned reads with mapping characteristics (e.g. mapping quality, depth, strand) and variants with annotation information from various databases in a user-friendly, smooth, intuitive and responsive method. We have divided the most prevalent visualisation tools into web servers and stand-alone applications. Ensembl Genome Browser, JBrowse Genome Browser, UCSC Genome Browser and VEGA Genome Browser are web servers that display the uploaded data in the context of online resources while avoiding the hassle of downloading and complex installation [95–98]. These servers can deal with user-defined tracks by either uploading a file or specifying a remote URL; however, they require additional data handling such as indexing and sorting files into a fixed format that may be difficult for inexperienced practitioners. Due to limitations such as network bandwidth, cybersecurity and privacy issues with respect to uploading large sequence data, web-based visualisation tools are more suitable for temporarily viewing data in a small region of interest or exploring published datasets included in public databases. Conversely, Artemis, IGV (Integrative Genomics Viewer) and Savant (Sequence Annotation and Visualisation and Analysis Tool) are stand-alone visualisation tools that can flexibly display variants with interactive browsing and zooming features [99–101]. Although there is no need to access large volume data remotely, these methods do require that annotation resources are downloaded and updated to allow meaningful visualisation and validation. In addition, the high memory and computing requirements for large volume data such as WGS samples require computers or clusters equipped with large memory and storage. Thus, stand-alone visualisation tools are suitable for experienced analysts with an IT background viewing sequence data in large-scale clinical cancer sequencing.

Conclusion and discussion

The process of translating obscure cancer genomic data to clinical practice involves a series of large and complicated steps, including foundational analysis to detect variants, advanced interpretation to reveal molecular profiles and multi-omics approaches to obtain biological insights. We limited this review to foundational analysis, an interdisciplinary area of bioinformatics and computer science, and conducted a comprehensive and systematic investigation of the tools, strategies and published literature regarding cancer genome sequencing, emphasising the importance of QC in each step. Although we investigated and compared many tools for cancer genome sequencing, other aspects, such as sample extraction, library preparation and sequencing processes, are of equal importance. Here, we limited the study to the analyses of sequencing data from Illumina sequencers. QC should be applied in each step of the cancer genome sequencing data analysis, i.e. raw data pre-processing, alignment, variant calling and annotation in different resolutions from a single sample, the cohort level (Figure 1). The recommendations we proposed will serve as valuable guidelines for both skilled and inexperienced practitioners in making decisions regarding the appropriate tools and optimal steps for specific applications, with the anticipation of promoting the development of precision medicine.

Open-source software, workflows and public resources have greatly boosted the mining of cancer genomes; however, there are a variety of tools available for each step of the analysis that can handle a diverse range of genomic data formats (FASTQ, BAM, SAM, VCF, MAF) and are attracting increasing attention to be more widely used in routine analysis. This is illustrated by some of the tools discussed in this review; for instance, NGSCheckMate validates sample identity from FASTQ, BAM and VCF input files and is effective for different data types (WES, WGS and RNA-seq), whereas VarDict detects a wide range of variant signals including SNVs, indels, CNVs and SVs with a single scan of aligned sequences. Similar tools with versatile features also include NGS-Bits, MultiQC and IGV.

There are several other aspects that can potentially influence cancer genome sequencing. Firstly, the entire foundational analysis of genomic data is a coherent process, wherein the results of the first step directly affect the accuracy of subsequent procedures. Therefore, tool developers should provide not only the functions and features of their own tool but also compatibility with upstream tools. Secondly, test datasets (such as NA12878) with various characteristics, ground truth and filtering standards sequenced from real samples should be improved or released to provide uniform and objective evaluation criteria. Thirdly, large-scale studies should aim to eliminate imbalanced population diversity in public databases as Landry et al. found significantly fewer studies in the Asian (29%) and other populations (4%; e.g. African and Latin American) than European populations (67%) [102]. Cancer genome sequencing relies on population-based resources to remove common genomic polymorphisms; therefore, population imbalance may have little effect on cancer investigations in underrepresented populations. However, it is necessary to establish genomic polymorphism databases in specific populations.

Key Points
  • Several popular pipelines or software in cancer genome sequencing data analysis cannot be used in all applications and should be selected based on specific research needs.

  • Local realignment step is very critical in the whole pipeline, but whether this step should be performed depends on the variant caller chosen in the next step.

  • Quality control should be conducted in each step using appropriate software to ensure accuracy of the analysis.

Funding

National Natural Science Foundation of China (grant number 31771466); National Key R&D Program of China (grant numbers 2018YFB0203903, 2016YFC0503607, and 2016YFB0200300); Special Project of Informatization of the Chinese Academy of Sciences, China (grant number XXH13504-08).

Conflict of interest

The authors have declared that no conflict of interests exist.

Xiaoyu He, Ph.D., Computer Network Information Center, Chinese Academy of Sciences; University of Chinese Academy of Sciences, is mainly engaged in cancer genomics research and construction of the Chinese cancer genome database.

Shanyu Chen, Computer Network Information Center, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Beijing 100190, China.

Ruilin Li, Computer Network Information Center, Chinese Academy of Sciences; Beijing 100190, China.

Xinyin Han, Computer Network Information Center, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Beijing 100190, China.

Zhipeng He, Computer Network Information Center, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Beijing 100190, China.

Danyang Yuan, Computer Network Information Center, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Beijing 100190, China.

Shuying Zhang, Computer Network Information Center, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Beijing 100190, China.

Xiaohong Duan, ChosenMed Technology (Beijing) Co., Ltd., Beijing 100176, China.

Beifang Niu, Computer Network Information Center, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Beijing 100190, China; ChosenMed Technology (Beijing) Co., Ltd., Beijing 100176, China.

References

1.

Challis
D
,
Yu
J
,
Evani
US
, et al.
An integrative variant analysis suite for whole exome next-generation sequencing data
.
BMC Bioinform
2012
;
13
:
8
.

2.

Team G
.
Getting started with GATK4
.

3.

DePristo
MA
,
Banks
E
,
Poplin
R
, et al.
A framework for variation discovery and genotyping using next-generation DNA sequencing data
.
Nat Genet
2011
;
43
:
491
8
.

4.

Mashl
RJ
,
Scott
AD
,
Huang
K-L
, et al.
GenomeVIP: a cloud platform for genomic variant discovery and interpretation
.
Genome Res
2017
;
27
:
1450
9
.

5.

Yakneen
S
,
Waszak
SM
,
Yakneen
S
, et al.
Butler enables rapid cloud-based analysis of thousands of human genomes
.
Nat Biotechnol
2020
;
38
:
288
92
.

6.

Zhao
S
,
Prenger
K
,
Smith
L
, et al.
Rainbow: a tool for large-scale whole-genome sequencing data analysis using cloud computing
.
BMC Genomics
2013
;
14
:
425
.

7.

Elshazly
H
,
Souilmi
Y
,
Tonellato
PJ
, et al.
MC-GenomeKey: a multicloud system for the detection and annotation of genomic variants
.
BMC Bioinform
2017
;
18
:
49
.

8.

Andrews
S.
Babraham bioinformatics—FastQC a quality control tool for high throughput sequence data
. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ (8 January 2019, date last accessed)
.

9.

Schroeder
CM
,
Hilke
FJ
,
Löffler
MW
, et al.
A comprehensive quality control workflow for paired tumor-normal NGS experiments
.
Bioinformatics
2017
;
33
:
1721
2
.

10.

Ewels
P
,
Magnusson
M
,
Lundin
S
, et al.
MultiQC: summarize analysis results for multiple tools and samples in a single report
.
Bioinformatics
2016
;
32
:
3047
8
.

11.

Gordon
A
,
Hannon
GJ
.
Fastx-toolkit
.
2010
.

12.

Martin
M
.
Cutadapt removes adapter sequences from high-throughput sequencing reads
.
EMBnet.J
2011
;
17
:
10
2
.

13.

Jiang
H
,
Lei
R
,
Ding
S-W
, et al.
Skewer: a fast and accurate adapter trimmer for next-generation sequencing paired-end reads
.
BMC Bioinform
2014
;
15
:
182
.

14.

Chen
Y
,
Chen
Y
,
Shi
C
, et al.
SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data
.
Giga Science
2017
;
7
:
1
6
.

15.

Krueger
F.
Trim galore!
.

16.

Li
YL
,
Weng
JC
,
Hsiao
CC
, et al.
PEAT: an intelligent and efficient paired-end sequencing adapter trimming algorithm
.
BMC Bioinform
2015
;
16
:
2
.

17.

Ewing
B
,
Hillier
L
,
Wendl
MC
, et al.
Base-calling of automated sequencer traces using Phred. I Accuracy assessment
.
Genome Res
1998
;
8
:
175
85
.

18.

Schmieder
R
,
Edwards
R
.
Quality control and preprocessing of metagenomic datasets
.
Bioinformatics
2011
;
27
:
863
4
.

19.

Patel
RK
,
Jain
M
.
NGS QC toolkit: a toolkit for quality control of next generation sequencing data
.
PLoS One
2012
;
7
:
e30619
.

20.

Zhou
Y
,
Chen
Y
,
Chen
S
, et al.
Fastp: an ultra-fast all-in-one FASTQ preprocessor
.
Bioinformatics
2018
;
34
:
i884
90
.

21.

Chen
S
,
Huang
T
,
Zhou
Y
, et al.
AfterQC: automatic filtering, trimming, error removing and quality control for fastq data
.
BMC Bioinform
2017
;
18
:
80
0
.

22.

Fonseca
NA
,
Rung
J
,
Brazma
A
, et al.
Tools for mapping high-throughput sequencing data
.
Bioinformatics
2012
;
28
:
3169
77
.

23.

Li
H
,
Handsaker
B
,
Wysoker
A
, et al.
The sequence alignment/map format and SAMtools
.
Bioinformatics
2009
;
25
:
2078
9
.

24.

Langmead
B
,
Salzberg
SL
.
Fast gapped-read alignment with bowtie 2
.
Nat Methods
2012
;
9
:
357
.

25.

Langmead
B
,
Trapnell
C
,
Pop
M
, et al.
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome
.
Genome Biol
2009
;
10
:
25
.

26.

Mose
LE
,
Wilkerson
MD
,
Hayes
DN
, et al.
ABRA: improved coding indel detection via assembly-based realignment
.
Bioinformatics
2014
;
30
:
2813
5
.

27.

Homer
N
,
Nelson
SF
.
Improved variant discovery through local re-alignment of short-read next-generation sequencing data using SRMA
.
Genome Biol
2010
;
11
:
R99
.

28.

Shlee
.
Changing workflows around calling SNPs and indels
.

29.

Lai
Z
,
Markovets
A
,
Ahdesmaki
M
, et al.
VarDict: a novel and versatile variant caller for next-generation sequencing in cancer research
.
Nucleic Acids Res
2016
;
44
:
e108
8
.

30.

Li
H
.
Improving SNP discovery by base alignment quality
.
Bioinformatics
2011
;
27
:
1157
8
.

31.

Guo
Y
,
Ye
F
,
Sheng
Q
, et al.
Three-stage quality control strategies for DNA re-sequencing data
.
Brief Bioinform
2014
;
15
:
879
89
.

32.

Rabbani
B
,
Tekin
M
,
Mahdieh
N
.
The promise of whole-exome sequencing in medical genetics
.
J Hum Genet
2014
;
59
:
5
15
.

33.

Clark
MJ
,
Chen
R
,
Lam
HYK
, et al.
Performance comparison of exome DNA sequencing technologies
.
Nat Biotechnol
2011
;
29
:
908
14
.

34.

Shiquan
.
Bamdst: a BAM depth stat tool
. https://github.com/shiquan/bamdst (13 Febrary 2020, date last accessed)
.

35.

Park
WY
,
Lee
S
,
Ouellette
S
, et al.
NGSCheckMate: software for validating sample identity in next-generation sequencing studies within and across data types
.
Nucleic Acids Res
2017
;
45
:
e103
3
.

36.

Wang
PPS
,
Parker
WT
,
Branford
S
, et al.
BAM-matcher: a tool for rapid NGS sample matching
.
Bioinformatics
2016
;
32
:
2699
701
.

37.

Pedersen
BS
,
Bhetariya
PJ
,
Marth
G
, et al.
Somalier: rapid relatedness estimation for cancer and germline studies using efficient genome sketches
.
bioRxiv
2019
;
839944
.

38.

Pedersen
BS
,
Quinlan
AR
.
Who's who? Detecting and resolving sample anomalies in human DNA sequencing studies with Peddy
.
Am J Hum Genet
2017
;
100
:
406
13
.

39.

Fasterius
E
,
Al-Khalili
SC
.
seqCAT: a bioconductor R-package for variant analysis of high throughput sequencing data
.
F1000 Res
2019
;
7
:
1466
.

40.

Schröder
J
,
Corbin
V
,
Papenfuss
AT
.
HYSYS: have you swapped your samples?
Bioinformatics
2016
;
33
:
596
8
.

41.

Conesa
A
,
García-Alcalde
F
,
Dopazo
J
, et al.
Qualimap: evaluating next-generation sequencing alignment data
.
Bioinformatics
2012
;
28
:
2678
9
.

42.

Koboldt
DC
,
Zhang
Q
,
Larson
DE
, et al.
VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing
.
Genome Res
2012
;
22
:
568
76
.

43.

Kim
S
,
Scheffler
K
,
Halpern
AL
, et al.
Strelka2: fast and accurate variant calling for clinical sequencing applications
.
Nat Methods
2018
;
15
:
591
4
.

44.

Ye
K
,
Schulz
MH
,
Long
Q
, et al.
Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads
.
Bioinformatics
2009
;
25
:
2865
71
.

45.

Larson
DE
,
Harris
CC
,
Chen
K
, et al.
SomaticSniper: identification of somatic point mutations in whole genome sequencing data
.
Bioinformatics
2011
;
28
:
311
7
.

46.

Mayrhofer
M
,
DiLorenzo
S
,
Isaksson
A
.
Patchwork: allele-specific copy number analysis of whole-genome sequenced tumor tissue
.
Genome Biol
2013
;
14
:
R24
.

47.

Talevich
E
,
Shain
AH
,
Botton
T
, et al.
CNVkit: genome-wide copy number detection and visualization from targeted DNA sequencing
.
PLoS Comput Biol
2016
;
12
:
e1004873
3
.

48.

Abyzov
A
,
Urban
AE
,
Snyder
M
, et al.
CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing
.
Genome Res
2011
;
21
:
974
84
.

49.

Yang
L
,
Luquette
LJ
,
Gehlenborg
N
, et al.
Diverse mechanisms of somatic structural variations in human cancer genomes
.
Cell
2013
;
153
:
919
29
.

50.

Rausch
T
,
Zichner
T
,
Schlattl
A
, et al.
DELLY: structural variant discovery by integrated paired-end and split-read analysis
.
Bioinformatics
2012
;
28
:
i333
9
.

51.

Sandmann
S
,
de Graaf
AO
,
Karimi
M
, et al.
Evaluating variant calling tools for non-matched next-generation sequencing data
.
Sci Rep
2017
;
7
:
43169
.

52.

Krøigård
AB
,
Thomassen
M
,
Lænkholm
A-V
, et al.
Evaluation of nine somatic variant callers for detection of somatic mutations in exome and targeted deep sequencing data
.
PLoS One
2016
;
11
:
e0151664
.

53.

Xu
H
,
DiCarlo
J
,
Satya
RV
, et al.
Comparison of somatic mutation calling methods in amplicon and whole exome sequence data
.
BMC Genomics
2014
;
15
:
244
.

54.

Weinhold
N
,
Jacobsen
A
,
Schultz
N
, et al.
Genome-wide analysis of noncoding regulatory mutations in cancer
.
Nat Genet
2014
;
46
:
1160
5
.

55.

Rashid
M
,
Robles-Espinoza
CD
,
Rust
AG
, et al.
Cake: a bioinformatics pipeline for the integrated analysis of somatic variants in cancer genomes
.
Bioinformatics
2013
;
29
:
2208
10
.

56.

Rpoplin
.
Variant quality score recalibration (VQSR)
.

57.

Wang
J
,
Raskin
L
,
Samuels
DC
, et al.
Genome measures used for quality control are dependent on gene function and ancestry
.
Bioinformatics
2015
;
31
:
318
23
.

58.

Alexandrov
LB
,
Nik-Zainal
S
,
Wedge
DC
, et al.
Signatures of mutational processes in human cancer
.
Nature
2013
;
500
:
415
.

59.

Cingolani
P
,
Platts
A
,
Wang
LL
, et al.
A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3
.
Fly (Austin)
6
:
2012
,
80
92
.

60.

Guo
Y
,
Zhao
S
,
Sheng
Q
, et al.
Multi-perspective quality control of Illumina exome sequencing data using QC3
.
Genomics
2014
;
103
:
323
8
.

61.

1000 Genomes Project Consortium
,
Auton
A
,
Abecasis
GR
, et al.
A global reference for human genetic variation
.
Nature
2015
;
526
:
68
74
.

62.

Karczewski
K
.
The genome aggregation database (gnomAD)
. http://gnomad.broadinstitute.org/ (29 March 2020, date last accessed)
.

63.

Auer Paul
L
,
Johnsen Jill
M
,
Johnson Andrew
D
, et al.
Imputation of exome sequence variants into population- based samples and blood-cell-trait-associated loci in African Americans: NHLBI GO exome sequencing project
.
Am J Hum Genet
2012
;
91
:
794
808
.

64.

Karczewski
KJ
,
Weisburd
B
,
Thomas
B
, et al.
The ExAC browser: displaying reference data information from over 60 000 exomes
.
Nucleic Acids Res
2017
;
45
:
D840
5
.

65.

Landrum
MJ
,
Lee
JM
,
Riley
GR
, et al.
ClinVar: public archive of relationships among sequence variation and human phenotype
.
Nucleic Acids Res
2014
;
42
:
D980
5
.

66.

Stenson
PD
,
Mort
M
,
Ball
EV
, et al.
The human gene mutation database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine
.
Hum Genet
2014
;
133
:
1
9
.

67.

Hamosh
A
,
Scott
AF
,
Amberger
JS
, et al.
Online mendelian inheritance in man (OMIM), a knowledgebase of human genes and genetic disorders
.
Nucleic Acids Res
2005
;
33
:
D514
7
.

68.

González-Pérez
A
,
López-Bigas
N
.
Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score, Condel
.
Am J Hum Genet
2011
;
88
:
440
9
.

69.

Ng
PC
,
Henikoff
S
.
SIFT: predicting amino acid changes that affect protein function
.
Nucleic Acids Res
2003
;
31
:
3812
4
.

70.

Davydov
EV
,
Goode
DL
,
Sirota
M
, et al.
Identifying a high fraction of the human genome to be under selective constraint using GERP++
.
PLoS Comput Biol
2010
;
6
:
e1001025
.

71.

Rentzsch
P
,
Witten
D
,
Cooper
GM
, et al.
CADD: predicting the deleteriousness of variants throughout the human genome
.
Nucleic Acids Res
2018
;
47
:
D886
94
.

72.

Ioannidis
N
,
Rothstein
J
,
Pejaver
V
, et al.
REVEL: an ensemble method for predicting the pathogenicity of rare missense variants
.
Am J Hum Genet
2016
;
99
:
877
85
.

73.

Glusman
G
,
Caballero
J
,
Mauldin
DE
, et al.
Kaviar: an accessible system for testing SNV novelty
.
Bioinformatics
2011
;
27
:
3216
7
.

74.

Iglesias
AI
,
van der Lee
SJ
,
Bonnemaijer
PWM
, et al.
Haplotype reference consortium panel: practical implications of imputations with large reference panels
.
Hum Mutat
2017
;
38
:
1025
32
.

75.

Quang
D
,
Chen
Y
,
Xie
X
.
DANN: a deep learning approach for annotating the pathogenicity of genetic variants
.
Bioinformatics
2015
;
31
:
761
3
.

76.

Shihab
HA
,
Gough
J
,
Mort
M
, et al.
Ranking non-synonymous single nucleotide polymorphisms based on disease concepts
.
Hum Genomics
2014
;
8
:
11
.

77.

Ionita-Laza
I
,
McCallum
K
,
Xu
B
, et al.
A spectral approach integrating functional genomic annotations for coding and noncoding variants
.
Nat Genet
2016
;
48
:
214
20
.

78.

Ritchie
GRS
,
Dunham
I
,
Zeggini
E
, et al.
Functional annotation of noncoding sequence variants
.
Nat Methods
2014
;
11
:
294
6
.

79.

Liu
X
,
Jian
X
,
Boerwinkle
E
.
dbNSFP: a lightweight database of human nonsynonymous SNPs and their functional predictions
.
Hum Mutat
2011
;
32
:
894
9
.

80.

Jian
X
,
Boerwinkle
E
,
Liu
X
.
In silico prediction of splice-altering single nucleotide variants in the human genome
.
Nucleic Acids Res
2014
;
42
:
13534
44
.

81.

Hsiao
Y-HE
,
Bahn
JH
,
Lin
X
, et al.
Alternative splicing modulated by genetic variants demonstrates accelerated evolution regulated by highly conserved proteins
.
Genome Res
2016
;
26
:
440
50
.

82.

Forbes
SA
,
Bhamra
G
,
Bamford
S
, et al.
The catalogue of somatic mutations in cancer (COSMIC)
.
Curr Protoc Hum Genet
2008
;
57
:
10.11.11
26
.

83.

Sherry
ST
,
Ward
M-H
,
Kholodov
M
, et al.
dbSNP: the NCBI database of genetic variation
.
Nucleic Acids Res
2001
;
29
:
308
11
.

84.

Wang
K
,
Li
M
,
Hakonarson
H
.
ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data
.
Nucleic Acids Res
2010
;
38
:
e164
4
.

85.

McLaren
W
,
Gil
L
,
Hunt
SE
, et al.
The ensembl variant effect predictor
.
Genome Biol
2016
;
17
:
122
.

86.

Ramos
AH
,
Lichtenstein
L
,
Gupta
M
, et al.
Oncotator: cancer variant annotation tool
.
Hum Mutat
2015
;
36
:
E2423
9
.

87.

Documentation NG
.
GDC MAF format v1.0.0
.

88.

Ruden
D
,
Cingolani
P
,
Patel
V
, et al.
Using Drosophila melanogaster as a model for Genotoxic chemical mutational studies with a new program. SnpSift
.
Front Genet
2012
;
3
:
35
.

89.

DnaMiner
.
ClinEff
. http://www.dnaminer.com/ (1 April 2020, date last accessed)
.

90.

McCarthy
DJ
,
Humburg
P
,
Kanapin
A
, et al.
Choice of transcripts and software has a large effect on variant annotation
.
Genome Med
2014
;
6
:
26
.

91.

Kettering
MS
.
vcf2maf
. https://github.com/mskcc/vcf2maf (1 April 2020, date last accessed)
.

92.

Lawrence
MS
,
Stojanov
P
,
Polak
P
, et al.
Mutational heterogeneity in cancer and the search for new cancer-associated genes
.
Nature
2013
;
499
:
214
8
.

93.

Niu
B
,
Scott
AD
,
Sengupta
S
, et al.
Protein-structure-guided discovery of functional mutations across 19 cancer types
.
Nat Genet
2016
;
48
:
827
37
.

94.

Lab LDs
.
MuSiC: mutational significance in cancer (cancer mutation analysis) version 2
. https://github.com/ding-lab/MuSiC2 (14 July 2019, date last accessed)
.

95.

Stalker
J
,
Gibbins
B
,
Meidl
P
, et al.
The ensembl web site: mechanics of a genome browser
.
Genome Res
2004
;
14
:
951
5
.

96.

Karolchik
D
,
Hinrichs
AS
,
Kent
WJ
.
The UCSC genome browser
.
Curr Protoc Bioinformatics
2009
;
Chapter 1
:
Unit1.4
1.4
.

97.

Loveland
J
.
VEGA, the genome browser with a difference
.
Brief Bioinform
2005
;
6
:
189
93
.

98.

Skinner
ME
,
Uzilov
AV
,
Stein
LD
, et al.
JBrowse: a next-generation genome browser
.
Genome Res
2009
;
19
:
1630
8
.

99.

Rutherford
K
,
Parkhill
J
,
Crook
J
, et al.
Artemis: sequence visualization and annotation
.
Bioinformatics
2000
;
16
:
944
5
.

100.

Robinson
JT
,
Thorvaldsdóttir
H
,
Winckler
W
, et al.
Integrative genomics viewer
.
Nat Biotechnol
2011
;
29
:
24
6
.

101.

Fiume
M
,
Williams
V
,
Brook
A
, et al.
Savant: genome browser for high-throughput sequencing data
.
Bioinformatics
2010
;
26
:
1938
44
.

102.

Landry
LG
,
Ali
N
,
Williams
DR
, et al.
Lack of diversity in genomic databases is a barrier to translating precision medicine research into practice
.
Health Aff
2018
;
37
:
780
5
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Supplementary data