A comprehensive review of deep learning-based variant calling methods

Summary of SNV and Indel detection methods based on deep learning

Model	Variant type	Fragment size	Region	Tensor encoding	Training set	Neural network	Advantage & limitation	Source code	Year
DeepVariant	SNV, INDEL	Short read	WGS, WES	Pileup image	Public	CNN	High accuracy, robust reliability; considerable computational resources	https://github.com/google/deepvariant/	2018
CNNScoreVariants	SNV, INDEL	Short read	WGS, WES	Pileup image	Public	CNN	Improved coding technique; enhanced predictive accuracy; coding redundancy	https://github.com/broadinstitute/gatk	2020
Clairvoyante	SNV, INDEL	Long read (PacBio, ONT)	WGS	Pileup image	Public	CNN	Few parameters; outstanding performance in long-read technology; cannot identify multi-allelic variants or Indels ≥4 bases; not considering the base quality	https://github.com/aquaskyline/Clairvoyante	2019
Clair	SNV, INDEL	Long read (PacBio, ONT)	WGS	Pileup image	Public	Bi-LSTM	Considerable improvements in precision, recall and speed; Indel calling of Nanopore data needs to be improved	https://github.com/HKUBAL/Clair	2020
NanoCaller	SNV, INDEL	Long read (PacBio, ONT)	WGS	Pileup image	Public	CNN	Combine long-range haplotype information; few parameters; incorrect alignments in low-complexity regions; potential inaccuracy in detecting Indels in nucleotide repeats of Nanopore data	https://github.com/WGLab/NanoCaller	2021
PEPPER-Margin-DeepVariant	SNV, INDEL	Long read (PacBio)	WGS	Full-alignment image	Public	CNN+ RNN	Haplotype-aware; encode more features; advanced variant calling results on Nanopore data; low Indel calling accuracy on Nanopore data	PEPPER: https://github.com/kishwarshafin/pepper Margin: https://github.com/UCSC-nanopore-cgl/margin DeepVariant: https://github.com/google/deepvariant	2021
Clair3	SNV, INDEL	Long read (ONT)	WGS	Full-alignment image and pileup image	Public	CNN + Bi-LSTM	Combine pileup-based and full alignment variant calling; fast runtime and excellent performance	https://github.com/HKUBAL/Clair3	2022
NanoSNP	SNV	Long read (ONT)	WGS	Pileup image and haplotype image	Public	Bi-LSTM	Combine long-range haplotype feature and short-range pileup feature; best performance on Nanopore data; cannot identify SNPs with short reads	https://github.com/huangnengCSU/NanoSNP.git	2023

Model	Variant type	Fragment size	Region	Tensor encoding	Training set	Neural network	Advantage & limitation	Source code	Year
DeepVariant	SNV, INDEL	Short read	WGS, WES	Pileup image	Public	CNN	High accuracy, robust reliability; considerable computational resources	https://github.com/google/deepvariant/	2018
CNNScoreVariants	SNV, INDEL	Short read	WGS, WES	Pileup image	Public	CNN	Improved coding technique; enhanced predictive accuracy; coding redundancy	https://github.com/broadinstitute/gatk	2020
Clairvoyante	SNV, INDEL	Long read (PacBio, ONT)	WGS	Pileup image	Public	CNN	Few parameters; outstanding performance in long-read technology; cannot identify multi-allelic variants or Indels ≥4 bases; not considering the base quality	https://github.com/aquaskyline/Clairvoyante	2019
Clair	SNV, INDEL	Long read (PacBio, ONT)	WGS	Pileup image	Public	Bi-LSTM	Considerable improvements in precision, recall and speed; Indel calling of Nanopore data needs to be improved	https://github.com/HKUBAL/Clair	2020
NanoCaller	SNV, INDEL	Long read (PacBio, ONT)	WGS	Pileup image	Public	CNN	Combine long-range haplotype information; few parameters; incorrect alignments in low-complexity regions; potential inaccuracy in detecting Indels in nucleotide repeats of Nanopore data	https://github.com/WGLab/NanoCaller	2021
PEPPER-Margin-DeepVariant	SNV, INDEL	Long read (PacBio)	WGS	Full-alignment image	Public	CNN+ RNN	Haplotype-aware; encode more features; advanced variant calling results on Nanopore data; low Indel calling accuracy on Nanopore data	PEPPER: https://github.com/kishwarshafin/pepper Margin: https://github.com/UCSC-nanopore-cgl/margin DeepVariant: https://github.com/google/deepvariant	2021
Clair3	SNV, INDEL	Long read (ONT)	WGS	Full-alignment image and pileup image	Public	CNN + Bi-LSTM	Combine pileup-based and full alignment variant calling; fast runtime and excellent performance	https://github.com/HKUBAL/Clair3	2022
NanoSNP	SNV	Long read (ONT)	WGS	Pileup image and haplotype image	Public	Bi-LSTM	Combine long-range haplotype feature and short-range pileup feature; best performance on Nanopore data; cannot identify SNPs with short reads	https://github.com/huangnengCSU/NanoSNP.git	2023

Table 1

Open in new tab Download slide

Summary of SNV and Indel detection methods based on deep learning

Model	Variant type	Fragment size	Region	Tensor encoding	Training set	Neural network	Advantage & limitation	Source code	Year
DeepVariant	SNV, INDEL	Short read	WGS, WES	Pileup image	Public	CNN	High accuracy, robust reliability; considerable computational resources	https://github.com/google/deepvariant/	2018
CNNScoreVariants	SNV, INDEL	Short read	WGS, WES	Pileup image	Public	CNN	Improved coding technique; enhanced predictive accuracy; coding redundancy	https://github.com/broadinstitute/gatk	2020
Clairvoyante	SNV, INDEL	Long read (PacBio, ONT)	WGS	Pileup image	Public	CNN	Few parameters; outstanding performance in long-read technology; cannot identify multi-allelic variants or Indels ≥4 bases; not considering the base quality	https://github.com/aquaskyline/Clairvoyante	2019
Clair	SNV, INDEL	Long read (PacBio, ONT)	WGS	Pileup image	Public	Bi-LSTM	Considerable improvements in precision, recall and speed; Indel calling of Nanopore data needs to be improved	https://github.com/HKUBAL/Clair	2020
NanoCaller	SNV, INDEL	Long read (PacBio, ONT)	WGS	Pileup image	Public	CNN	Combine long-range haplotype information; few parameters; incorrect alignments in low-complexity regions; potential inaccuracy in detecting Indels in nucleotide repeats of Nanopore data	https://github.com/WGLab/NanoCaller	2021
PEPPER-Margin-DeepVariant	SNV, INDEL	Long read (PacBio)	WGS	Full-alignment image	Public	CNN+ RNN	Haplotype-aware; encode more features; advanced variant calling results on Nanopore data; low Indel calling accuracy on Nanopore data	PEPPER: https://github.com/kishwarshafin/pepper Margin: https://github.com/UCSC-nanopore-cgl/margin DeepVariant: https://github.com/google/deepvariant	2021
Clair3	SNV, INDEL	Long read (ONT)	WGS	Full-alignment image and pileup image	Public	CNN + Bi-LSTM	Combine pileup-based and full alignment variant calling; fast runtime and excellent performance	https://github.com/HKUBAL/Clair3	2022
NanoSNP	SNV	Long read (ONT)	WGS	Pileup image and haplotype image	Public	Bi-LSTM	Combine long-range haplotype feature and short-range pileup feature; best performance on Nanopore data; cannot identify SNPs with short reads	https://github.com/huangnengCSU/NanoSNP.git	2023

Model	Variant type	Fragment size	Region	Tensor encoding	Training set	Neural network	Advantage & limitation	Source code	Year
DeepVariant	SNV, INDEL	Short read	WGS, WES	Pileup image	Public	CNN	High accuracy, robust reliability; considerable computational resources	https://github.com/google/deepvariant/	2018
CNNScoreVariants	SNV, INDEL	Short read	WGS, WES	Pileup image	Public	CNN	Improved coding technique; enhanced predictive accuracy; coding redundancy	https://github.com/broadinstitute/gatk	2020
Clairvoyante	SNV, INDEL	Long read (PacBio, ONT)	WGS	Pileup image	Public	CNN	Few parameters; outstanding performance in long-read technology; cannot identify multi-allelic variants or Indels ≥4 bases; not considering the base quality	https://github.com/aquaskyline/Clairvoyante	2019
Clair	SNV, INDEL	Long read (PacBio, ONT)	WGS	Pileup image	Public	Bi-LSTM	Considerable improvements in precision, recall and speed; Indel calling of Nanopore data needs to be improved	https://github.com/HKUBAL/Clair	2020
NanoCaller	SNV, INDEL	Long read (PacBio, ONT)	WGS	Pileup image	Public	CNN	Combine long-range haplotype information; few parameters; incorrect alignments in low-complexity regions; potential inaccuracy in detecting Indels in nucleotide repeats of Nanopore data	https://github.com/WGLab/NanoCaller	2021
PEPPER-Margin-DeepVariant	SNV, INDEL	Long read (PacBio)	WGS	Full-alignment image	Public	CNN+ RNN	Haplotype-aware; encode more features; advanced variant calling results on Nanopore data; low Indel calling accuracy on Nanopore data	PEPPER: https://github.com/kishwarshafin/pepper Margin: https://github.com/UCSC-nanopore-cgl/margin DeepVariant: https://github.com/google/deepvariant	2021
Clair3	SNV, INDEL	Long read (ONT)	WGS	Full-alignment image and pileup image	Public	CNN + Bi-LSTM	Combine pileup-based and full alignment variant calling; fast runtime and excellent performance	https://github.com/HKUBAL/Clair3	2022
NanoSNP	SNV	Long read (ONT)	WGS	Pileup image and haplotype image	Public	Bi-LSTM	Combine long-range haplotype feature and short-range pileup feature; best performance on Nanopore data; cannot identify SNPs with short reads	https://github.com/huangnengCSU/NanoSNP.git	2023

Tensor encoding

Variant calling like DeepVariant [63] is treated as an image classification problem using tensor encoding. Sequencing data are turned into an image and analyzed to identify genetic variations. DeepVariant learns the relationship between read pileup image and true genotype calls for accurate variant identification. The process starts with identifying candidate SNPs and Indels, processing mapped reads with a local read assembly procedure based on De Bruijn graph [64] and selecting best haplotypes with a hidden Markov model (HMM) [65]. The Smith–Waterman-like algorithm [66] is used for read realignment, and only high-quality reads are considered for variant calling. Candidate gene reaching the threshold are encoded as 3-channel RGB pileup images, with the first row representing the reference sequence and the remaining rows representing reads, resulting in one image per candidate site.

CNNScoreVariants [67] retains both read and reference sequence information, adding mapping quality and read flags. Unlike DeepVariant, it encodes base qualities into the read tensor’s base channel, treating it as a confidence level for each base call. It uses one-hot and p-hot encoding for reference and read tensors, respectively. The reference tensor has a two-dimensional (2D) reference base centered at a variant, and the read tensor is a three-dimensional tensor spanning different genomic sites in width and different read pileups in height. The reference tensor’s channels are the four DNA bases, while the read tensor’s first four channels encode base quality, and the remaining five channels encode read flags representing the chain, pairing and mapping quality.

Clairvoyante [68], specifically designed for SNP and Indel calls in single-molecule sequencing data, uses a three-dimensional tensor to encode information about the read and reference sequence. The tensor’s dimensions represent the site, count of the four bases on the reads and four different counting methods. In the third dimension, four different counting methods are used to generate four tensors: (1) for the reference sequence and supporting reads, (2) for inserted sequences, (3) for deleted base pairs and (4) for alternative alleles.

Clair [69] is an improved version of Clairvoyante, which introduces four new tasks in deep learning to address the limitations of Clairvoyante, including multi-allelic variant calling and long Indel calling. Clair employs a similar encoding method as Clairvoyante, but with a second dimension in its three-dimensional tensor that is twice the size, representing four base positive/negative strand counts.

NanoCaller [70] is designed for long read sequencing data, using long-range haplotype information and phased reads to improve variant calling accuracy. Unlike other tools such as DeepVariant, Clairvoyante and Clair, it only considers heterozygous SNP sites far from the candidate site in the pileup images. The read and reference sequence information of SNP candidate sites is encoded into a three-dimensional tensor, including alternative alleles, candidate sites and reference sequence bases. The pileup image represents different bases at the candidate site using five channels, with the fifth channel representing the reference sequence. For Indel candidate sites, the pileup image is also encoded as a three-dimensional tensor, combining all reads, reads in one phase and reads in the other phase at the candidate site matrices. The three dimensions denote bases or deletion, pileup columns of realigned sequences and two matrices.

PEPPER-Margin-DeepVariant [71] and NanoCaller both use a haplotype-aware strategy for variant calling in long-read sequencing data. PEPPER-SNP, a submodel of PEPPER, calls SNPs using tensor encoding and represents each genomic site with 10 features. Different bases are color-coded, with each row and column representing a feature and reference genome site, respectively. Observations are coded as weights, shown as the alpha of each base. Another submodule, PEPPER-HP, considers SNVs and Indels as candidate variants and generates haplotype-specific likelihoods for each candidate variant. Similar to PEPPER-SNP, PEPPER-HP uses an encoding in which each column represents a reference position with two values, indicating the reference sequence site and the insertion alleles targeted at that site.

Clairvoyante, Clair and NanoCaller are based on pileup, which has an advantage in terms of time efficiency. PEPPER-Margin-DeepVariant is based on full-alignment variant calling, which provides the highest precision and recall. Clair3 [72], as the successor of Clair, combines the advantages of both methods by using full alignment for difficult variant candidates and pileup calling for the majority of candidates, resulting in fast runtime and excellent performance. Clair3’s input includes a 2D tensor for pileup and a three-dimensional tensor for full alignment, encoding genome site, features and various information related to the variant calling process.

NanoSNP [73] is a SNP calling method for low-coverage Nanopore sequencing reads, combining long-range haplotype and short-range pileup features for precise SNP identification. The process starts with a pileup model predicting SNP sites by extracting pileup features from aligned reads and using these SNPs for read phasing. The haplotype model then use both type information to extract relevant data for each haplotype, including distribution of nucleotide, base quality and mapping quality. This data forms the feature tensor, used to generate a pileup image for the candidate SNP site. Moreover, a group of high-quality heterozygous SNP sites adjacent to the candidate SNP site is selected to form the long-range haplotype image.

Training set

Deep learning classification heavily relies on the quality of the training labels. In order to achieve high-performance inference, having a gold-standard dataset is crucial. Due to the availability of multiple benchmark datasets, SNV and Indel variant detection methods based on deep learning often use publicly available datasets for training. However, it is important to carefully select the training datasets as it directly impacts the model’s performance and robustness. The Genome in a Bottle (GIAB) project [74–76] is a widely used public database that contains samples with high-quality genomic sequences and is often used as a source of training datasets for these methods [77], such as DeepVariant [63], Clairvoyante [68] and CNNScoreVariants [67].

Network architecture

Most deep learning frameworks for genome variant calling primarily rely on the convolutional neural network (CNN) [78, 79] model, while a few utilize the recurrent neural network (RNN) [80] model, and some methods incorporate both networks. The architecture of CNN models for SNV and Indel detection typically includes an input layer, convolutional layer, fully connected layer, softmax layer and output layer. RNN models, on the other hand, consist of an input layer, LSTM layer, fully connected layer, softmax layer and output layer. Some deep learning methods utilize two networks for variant calling. The network architecture is illustrated in Figure 2.

Figure 2

The hierarchical architecture of network layers in deep learning methods illustrated. Most deep learning frameworks for genome variant calling are based on the CNN model, while a few are based on the RNN model, and some methods incorporate both networks.

DeepVariant [63] is the first CNN-based approach for detecting genome variants. It adopts the Inception architecture [81–83]. Specifically, an image input layer is first created, which is appended to the ConvNetJuly2015v2 [81] CNN with nine partitions. The final output layer is a softmax layer with three categories, representing the probabilities of the three genotypes, fully connected to the previous layer.

CNNScoreVariants [67] is a CNN-based deep learning method that uses 1D CNNs on reference tensors and 2D CNNs on read tensors, alternating between the axes of genome site and pileup reads. Max pooling is applied to the pileup axis, not the sequence axis due to the discrete nature of DNA sequence data. After several convolution layers, the spatial dimensions of the tensor are flattened to a single 1D vector that is merged with batch-normalized variant annotations [82], directed into fully connected layers and forwarded to the final softmax layer. Performance is improved by including a skip connection, concatenating normalized annotations with the penultimate dense layer or all deeper layers.

Clairvoyante [68] is a multitask five-layer CNN, including three convolution layers with varying core counts and two fully connected layers. Pooling is performed after each convolutional layer. It makes four sets of predictions for each input, with one set calculated from the first fully connected layer and the remaining sets calculated from the second fully connected layer, which are mutually exclusive.

NanoCaller [70] uses two CNNs, one for SNP detection and another for Indel detection. Both models have three convolutional layers with varying kernel sizes, which are then joined a Flatten layer and fully connected layers. The two models differ in how they calculate probabilities: the SNP model has two independent pathways for calculating base probabilities and zygosity probabilities, while the Indel model uses two fully connected hidden layers to determine probabilities for four zygosities scenarios. The final output is obtained by combining SNP and Indel network calls.

Clair [69] consists of a five-layer RNN with four tasks, including two bi-directional long short-term memory (Bi-LSTM) [84] layers and three fully connected layers. Each Bi-LSTM layer houses 256 cells, and the first fully connected layer carries out transposition and splitting. The second Bi-LSTM layer, along with the second and third fully connected layers, have certain dropout rates for information. Clair generates data for these four tasks and possesses an independent penultimate layer (i.e. the third fully connected layer) preceding each task output. This design ensures the independence of each task's output.

Clair3 [72] employs two networks: a pileup network with two different-sized Bi-LSTM layers and a full-alignment network based on the residual network (ResNet) with three standard residual blocks. The pileup network omits the transpose-split layer for enhanced speed. The full-alignment network incorporates a convolutional layer in each block for channel dimensionality. The spatial pyramid pool (SPP) layer [85], acting as a pooling layer, creates different receptive fields with three pooling scales per channel, providing a fixed-length output for the subsequent layer.

NanoSNP [73] comprises two networks: the pileup model network and the haplotype model network. The pileup model has two Bi-LSTM layers and fully connected layers, using tanh activation to extract tensor sequence for SNP prediction. Probabilities are computed using fully connected layers with softmax activation. The haplotype model, with a pileup image processing module and a haplotype image processing module, shares the same structure. Both models capture feature correlations at the SNP site, combining outputs through a fully connected layer for SNP prediction.

Methods for detecting SNVs and Indels in short read sequencing, such as DeepVariant and CNNScoreVariants, are applicable to WGS and WES data. Long read sequencing has two major platforms: PacBio and ONT. Clairvoyante, Clair and NanoCaller are able to analyze data generated by these two platforms. Although Clairvoyante and Clair can also process Illumina data, they perform better with data from the PacBio and ONT. Due to the higher error rate associated with ONT sequencing compared to PacBio, PEPPER-Margin-DeepVariant is specifically designed for analyzing data from PacBio. To address the high error rate of ONT, Clair3 and NanoSNP is specially developed for processing data from the ONT sequencing platform.

SV calling

Compared to SNP and Indel calling, SV calling is a more complex task, due to the variety of types of SVs and their inherent complexity. The following section will introduce several classical deep learning methods in SV detection, which are summarized in Table 2.

Table 2

Summary of SV detection methods based on deep learning

Model	Variant type	Fragment size	Region	Tensor encoding	Training set	Neural network	Advantage & limitation	Source code	Year
DeepSV	SV (deletion)	Short read	WGS	Image	Public	CNN	Call long deletions; work with noisy training data	https://github.com/CSuperlei/DeepSV	2019
Samplot-ML	SV (deletion)	Short read	WGS	Image	Public	CNN	Call long deletions	https://github.com/mchowdh200/samplot-ml	2020
TensorSV	SV (deletion, duplication, inversion)	Short read	WGS	Image	Public	CNN	Detect deletions, duplications and inversions; effective genotyping; quick training and inference speeds	https://github.com/timothyjamesbecker/TensorSV	2020
DeepCNV	SV (CNV)	Microarray data and short read	WGS	Image	Synthetic	CNN + DNN	Detect CNVs; enhanced confidence; fewer false positives and failures in replicating associations; rely on the CNV callers to generate raw CNV calls	https://github.com/CAG-CNV/DeepCNV	2021
DeepSVFilter	SV	Short read	WGS	Image	Pretraining	CNN	Filter SVs; employ transfer learning and data augmentation to deal with small datasets; cannot work on WES data; relatively small number of high confidence SVs used to construct the training set	https://github.com/yongzhuang/DeepSVFilter	2021
BreakNet	SV (deletion)	Long read (PacBio)	WGS	Feature matrix	Public	CNN + Bi-LSTM	Detect deletions; stable performance on low coverage data	https://github.com/luojunwei/BreakNet	2021
MAMnet	SV (deletion, insertion)	Long read (PacBio, ONT)	WGS	Feature matrix	Public	CNN + Bi-LSTM	Call insertions and deletions; improved performance on low coverage data	https://github.com/micahvista/MAMnet	2022
svBreak	SV	Short read	WGS	Feature matrix	Synthetic	CNN	Detect 7 common SV breakpoints	https://github.com/BDanalysis/svBreak	2022
DECoNT	SV (CNV)	Short read	WES	Problem formulation	public	Bi-LSTM	Enhanced call accuracy, reliable germline CNV detection on WES datasets; reliance on existing variation callers	https://github.com/ciceklab/DECoNT	2022
CNV-espresso	SV (CNV)	Short read	WES	Image	Pretraining	CNN	Validation tool; detect rare CNV; cannot detect general CNV; cannot improve sensitivity	https://github.com/ShenLab/CNV-Espresso	2022
SVision	SV (CSV)	Long read (PacBio, ONT)	WGS	Image	Synthetic	CNN	Detect both simple and complex SVs	https://github.com/xjtu-omics/SVision	2022
Cue	SV (CSV)	Short read (extensible)	WGS	Image	Synthetic	Stacked hourglass network	Call and genotype both simple and complex SVs; learn complex SV abstractions directly from data; can be extended to different sequencing platforms	https://github.com/PopicLab/cue	2023

Model	Variant type	Fragment size	Region	Tensor encoding	Training set	Neural network	Advantage & limitation	Source code	Year
DeepSV	SV (deletion)	Short read	WGS	Image	Public	CNN	Call long deletions; work with noisy training data	https://github.com/CSuperlei/DeepSV	2019
Samplot-ML	SV (deletion)	Short read	WGS	Image	Public	CNN	Call long deletions	https://github.com/mchowdh200/samplot-ml	2020
TensorSV	SV (deletion, duplication, inversion)	Short read	WGS	Image	Public	CNN	Detect deletions, duplications and inversions; effective genotyping; quick training and inference speeds	https://github.com/timothyjamesbecker/TensorSV	2020
DeepCNV	SV (CNV)	Microarray data and short read	WGS	Image	Synthetic	CNN + DNN	Detect CNVs; enhanced confidence; fewer false positives and failures in replicating associations; rely on the CNV callers to generate raw CNV calls	https://github.com/CAG-CNV/DeepCNV	2021
DeepSVFilter	SV	Short read	WGS	Image	Pretraining	CNN	Filter SVs; employ transfer learning and data augmentation to deal with small datasets; cannot work on WES data; relatively small number of high confidence SVs used to construct the training set	https://github.com/yongzhuang/DeepSVFilter	2021
BreakNet	SV (deletion)	Long read (PacBio)	WGS	Feature matrix	Public	CNN + Bi-LSTM	Detect deletions; stable performance on low coverage data	https://github.com/luojunwei/BreakNet	2021
MAMnet	SV (deletion, insertion)	Long read (PacBio, ONT)	WGS	Feature matrix	Public	CNN + Bi-LSTM	Call insertions and deletions; improved performance on low coverage data	https://github.com/micahvista/MAMnet	2022
svBreak	SV	Short read	WGS	Feature matrix	Synthetic	CNN	Detect 7 common SV breakpoints	https://github.com/BDanalysis/svBreak	2022
DECoNT	SV (CNV)	Short read	WES	Problem formulation	public	Bi-LSTM	Enhanced call accuracy, reliable germline CNV detection on WES datasets; reliance on existing variation callers	https://github.com/ciceklab/DECoNT	2022
CNV-espresso	SV (CNV)	Short read	WES	Image	Pretraining	CNN	Validation tool; detect rare CNV; cannot detect general CNV; cannot improve sensitivity	https://github.com/ShenLab/CNV-Espresso	2022
SVision	SV (CSV)	Long read (PacBio, ONT)	WGS	Image	Synthetic	CNN	Detect both simple and complex SVs	https://github.com/xjtu-omics/SVision	2022
Cue	SV (CSV)	Short read (extensible)	WGS	Image	Synthetic	Stacked hourglass network	Call and genotype both simple and complex SVs; learn complex SV abstractions directly from data; can be extended to different sequencing platforms	https://github.com/PopicLab/cue	2023

Table 2

Summary of SV detection methods based on deep learning

Model	Variant type	Fragment size	Region	Tensor encoding	Training set	Neural network	Advantage & limitation	Source code	Year
DeepSV	SV (deletion)	Short read	WGS	Image	Public	CNN	Call long deletions; work with noisy training data	https://github.com/CSuperlei/DeepSV	2019
Samplot-ML	SV (deletion)	Short read	WGS	Image	Public	CNN	Call long deletions	https://github.com/mchowdh200/samplot-ml	2020
TensorSV	SV (deletion, duplication, inversion)	Short read	WGS	Image	Public	CNN	Detect deletions, duplications and inversions; effective genotyping; quick training and inference speeds	https://github.com/timothyjamesbecker/TensorSV	2020
DeepCNV	SV (CNV)	Microarray data and short read	WGS	Image	Synthetic	CNN + DNN	Detect CNVs; enhanced confidence; fewer false positives and failures in replicating associations; rely on the CNV callers to generate raw CNV calls	https://github.com/CAG-CNV/DeepCNV	2021
DeepSVFilter	SV	Short read	WGS	Image	Pretraining	CNN	Filter SVs; employ transfer learning and data augmentation to deal with small datasets; cannot work on WES data; relatively small number of high confidence SVs used to construct the training set	https://github.com/yongzhuang/DeepSVFilter	2021
BreakNet	SV (deletion)	Long read (PacBio)	WGS	Feature matrix	Public	CNN + Bi-LSTM	Detect deletions; stable performance on low coverage data	https://github.com/luojunwei/BreakNet	2021
MAMnet	SV (deletion, insertion)	Long read (PacBio, ONT)	WGS	Feature matrix	Public	CNN + Bi-LSTM	Call insertions and deletions; improved performance on low coverage data	https://github.com/micahvista/MAMnet	2022
svBreak	SV	Short read	WGS	Feature matrix	Synthetic	CNN	Detect 7 common SV breakpoints	https://github.com/BDanalysis/svBreak	2022
DECoNT	SV (CNV)	Short read	WES	Problem formulation	public	Bi-LSTM	Enhanced call accuracy, reliable germline CNV detection on WES datasets; reliance on existing variation callers	https://github.com/ciceklab/DECoNT	2022
CNV-espresso	SV (CNV)	Short read	WES	Image	Pretraining	CNN	Validation tool; detect rare CNV; cannot detect general CNV; cannot improve sensitivity	https://github.com/ShenLab/CNV-Espresso	2022
SVision	SV (CSV)	Long read (PacBio, ONT)	WGS	Image	Synthetic	CNN	Detect both simple and complex SVs	https://github.com/xjtu-omics/SVision	2022
Cue	SV (CSV)	Short read (extensible)	WGS	Image	Synthetic	Stacked hourglass network	Call and genotype both simple and complex SVs; learn complex SV abstractions directly from data; can be extended to different sequencing platforms	https://github.com/PopicLab/cue	2023

Model	Variant type	Fragment size	Region	Tensor encoding	Training set	Neural network	Advantage & limitation	Source code	Year
DeepSV	SV (deletion)	Short read	WGS	Image	Public	CNN	Call long deletions; work with noisy training data	https://github.com/CSuperlei/DeepSV	2019
Samplot-ML	SV (deletion)	Short read	WGS	Image	Public	CNN	Call long deletions	https://github.com/mchowdh200/samplot-ml	2020
TensorSV	SV (deletion, duplication, inversion)	Short read	WGS	Image	Public	CNN	Detect deletions, duplications and inversions; effective genotyping; quick training and inference speeds	https://github.com/timothyjamesbecker/TensorSV	2020
DeepCNV	SV (CNV)	Microarray data and short read	WGS	Image	Synthetic	CNN + DNN	Detect CNVs; enhanced confidence; fewer false positives and failures in replicating associations; rely on the CNV callers to generate raw CNV calls	https://github.com/CAG-CNV/DeepCNV	2021
DeepSVFilter	SV	Short read	WGS	Image	Pretraining	CNN	Filter SVs; employ transfer learning and data augmentation to deal with small datasets; cannot work on WES data; relatively small number of high confidence SVs used to construct the training set	https://github.com/yongzhuang/DeepSVFilter	2021
BreakNet	SV (deletion)	Long read (PacBio)	WGS	Feature matrix	Public	CNN + Bi-LSTM	Detect deletions; stable performance on low coverage data	https://github.com/luojunwei/BreakNet	2021
MAMnet	SV (deletion, insertion)	Long read (PacBio, ONT)	WGS	Feature matrix	Public	CNN + Bi-LSTM	Call insertions and deletions; improved performance on low coverage data	https://github.com/micahvista/MAMnet	2022
svBreak	SV	Short read	WGS	Feature matrix	Synthetic	CNN	Detect 7 common SV breakpoints	https://github.com/BDanalysis/svBreak	2022
DECoNT	SV (CNV)	Short read	WES	Problem formulation	public	Bi-LSTM	Enhanced call accuracy, reliable germline CNV detection on WES datasets; reliance on existing variation callers	https://github.com/ciceklab/DECoNT	2022
CNV-espresso	SV (CNV)	Short read	WES	Image	Pretraining	CNN	Validation tool; detect rare CNV; cannot detect general CNV; cannot improve sensitivity	https://github.com/ShenLab/CNV-Espresso	2022
SVision	SV (CSV)	Long read (PacBio, ONT)	WGS	Image	Synthetic	CNN	Detect both simple and complex SVs	https://github.com/xjtu-omics/SVision	2022
Cue	SV (CSV)	Short read (extensible)	WGS	Image	Synthetic	Stacked hourglass network	Call and genotype both simple and complex SVs; learn complex SV abstractions directly from data; can be extended to different sequencing platforms	https://github.com/PopicLab/cue	2023

Tensor encoding

DeepSV [86] effectively utilizes a variety of information sources to identify long deletions in sequence data. It employs (R, G, B) image coding sequences, assigning each nucleotide (A, T, C or G) a basic color slightly modified to incorporate deletion signatures. The visualization process combines key features of deletions, including read depth, split read and discordant pair. Read depth is depicted through pileup images; split read and inconsistent read pairs are integrated by adjusting the base color of the mapped base according to the signature. Moreover, the color coding takes into account paired, concordant/discordant, mapping quality and mapping type to enhance the visualization.

Samplot-ML [87] is also a method for identifying long deletions. It uses Samplot [88] to generate images of deletions. Similar to DeepSV, this visualization incorporates read depth, discordant pairs and split read signals.

DeepCNV [89] utilizes image data and metadata for CNV detection, generating image files automatically with PennCNV’s auxiliary visualization program [90]. Each CNV call includes an LRR and a BAF scatter plot image [91] showing the potential CNV segment and its adjacent regions. The LRR plot shows SNP genotypes in the region, and the BAF plot covers the same region. DeepCNV differentiates SNPs by color-coding them. Pixel values of these images are normalized to a scale between 0 and 1. Additionally, PennCNV generates a summary of 13 features for quality checking, which are also normalized. This method by DeepCNV significantly improves the reliability of CNV calling, reducing false positive and failures in replicating CNV associations.

CNV-espresso [92] is another method used for detecting CNV, but only for rare CNV. It encodes the read depth signal of each candidate CNV into an image, where the X-axis represents the CNV coordinates in the human genome, and the Y-axis represents the normalized read depth value.

DeepSVFilter [93] filters SVs in short-read WGS data by encoding SV signals as images, treating SV filtering as a binary classification issue. It represents each SV breakpoint as a three-dimensional tensor image comprising read depth, split read and discordant read pair channels, with pixels covered by a read encoded as ‘255’ and others as ‘0’. If an image covers two breakpoints of a single SV, separate images are generated and then vertically spliced for a complete SV image. Post-training, DeepSVFilter can filter any SV call sets from any detection methods.

BreakNet [94] is designed for identifying deletions within long reads. It divides the reference into various subregions and, based on alignment data, creates a feature matrix for each subregion. In this matrix, each row corresponds to an aligned long read, and each column signifies whether a deletion exists at that specific position. BreakNet organizes the rows in order of deletion count, choosing the top n rows to generate the matrix. In instances where the rows are fewer than n, the remaining elements default to 0. Despite delivering consistent performance on data with low coverage, BreakNet has a limitation. It can only detect deletions due to the constraints of the extracted features.

MAMnet [95] works to detect genome insertions and deletions by comparing long reads to the reference genome. Specifically, the reference genome is split into contiguous subregions, to which the reads are then aligned. The average read coverage is calculated for selected subregions, and the reads that overlap with each subregion are collected. MAMnet proceeds to compute nine features for each base position and averages them. A signature matrix is constructed for every subregion, followed by a logarithmic transformation to stabilize the deep neural network (DNN) training.

svBreak [96] is employed to identify prevalent types of SV breakpoints. It gleans 12 SV-related features for each genome site from the sequencing reads that are aligned to the reference genome. Each of these features is assigned a value of 1, 0 or −1, indicative of the positive, negative or uncertain status of each genome site. Following this, svBreak constructs a data matrix where each row corresponds to a site and each column signifies a feature. This approach enables the simultaneous calling and distinguishing of seven common SV breakpoints.

SVision [97] is a deep learning–based multi-object recognition framework that detects and identifies complex structural variants in sequencing data, which are often overlooked due to multiple breakpoints. The encoder takes variant feature sequences (VAR) and encodes dissimilarities and similarities between mutation-supporting read and reference genomes (REF) into images with three channels representing matched, duplicated and inverted segments. For each pair of VAR and REF, the encoder identifies matched and unmatched bases to create VAR-to-REF and REF-to-REF images. Variant features are highlighted by eliminating background noise through subtracting the REF-to-REF image from the corresponding VAR-to-REF image, resulting in a denoised image for each variant.

Cue [98] is capable of detecting and genotyping deletions, duplications, inversions, inverted duplications and inversions flanked by deletions that exceed 5 kb in length, with the latter two classified as complex SVs. It learns complex mutation patterns directly from the data, transforming sequences into images that encode SV information signals. In this process, Cue converts read sequences into images that capture multiple sequence signals between two genome intervals. Utilizing a pre-trained neural network, it generates Gaussian response confidence maps for each image, which encode the site, type and genotype of SVs in the image. Subsequently, it refines the high-confidence SV predictions and reassigns them to genome coordinates from the images.

Training set

Training data is vital for machine learning, with quality data leading to better performance. However, SV calling is challenging due to lack of training data and the absence of a universally accepted gold standard, necessitating the creation of SV training sets. This can be achieved through three main methods: using publicly available datasets, manually creating artificial data and using pretraining methodologies.

Publicly accessible datasets are commonly used for training sets, but their scale and heterogeneity may not cover all SV types and variations. Several algorithms like DeepSV [86], DECoNT [99], Samplot-ML [87], TensorSV [100], BreakNet [94] and MAMnet [95] utilize this method. DeepSV, DECoNT and Samplot-ML use the 1000 Genomes Project dataset [1]. Specifically, TensorSV uses three different datasets including the 1000 Genomes Project and the Human Genome SV project datasets [101], BreakNet uses multiple read alignment files from four extensively researched individuals and MAMnet uses six datasets from varying sequencing technologies.

Generating synthetic data is another method, but it may not fully represent real-world SV diversity. Algorithms like SVision [97], svBreak [96], DeepCNV [89] and Cue [98] use this method. SVision trains its CNN model on a mix of real and simulated SVs to ensure a balanced representation of SV types. VISOR [102] supplements training data for inversions, duplications and tandem duplications. svBreak uses simulated SV breakpoints for training. DeepCNV collects a dataset of SVs, including CNVs, from the WGS data generated by other CNV callers [6]. Raw CNV calling files are obtained from the 1000 Genomes FTP site, with false positives filtered out.

Pretraining methods involve initially training a model with a large dataset to learn general SV features, followed by fine-tuning it with a smaller specific dataset with limited SV annotations. This solves the lack of training data and improves accuracy. DeepSVFilter [93] is an example of this method, starting by identifying SVs from selected samples’ short WGS data using advanced SV calling methods [44–46] and then merging all the SVs into a unified set. CNV-espresso [92] is another example, building its training set from offspring–parents trio exome sequencing data.

In summary, building a comprehensive and representative training set for deep learning–based SV analysis requires careful evaluation of the available data and the diversity of SV types. Therefore, a combination of these methods may be necessary to achieve optimal performance in practical applications.

Network architecture

Neural networks for SV detection typically use CNN, RNN (Bi-LSTM) or a combination of both. The CNN effectively handles spatial structure information by extracting local features through convolutional and pooling layers and combining them via fully connected layers. It automatically captures local patterns and global characteristics, benefiting image classification tasks. Bi-LSTM captures temporal dependencies by considering both preceding and succeeding information, which is advantageous for modelling genomic sequence features. Combining the CNN and Bi-LSTM can extract spatial and temporal patterns comprehensively, improving classification performance. The same approach applies to encoding genomic sequence information as feature matrices.

svBreak [96] employs seven CNN models, each serving as a binary classification model, to detect various genomic variations. The network comprises five convolutional layers and three fully connected layers, with the CNN models extracting features and examining local data calculations. A weighted non-linear mapping is applied to the output matrix by the activation layer using the rectified linear unit (ReLU) function [103], with a 2 × 2 pooling layer size for better data processing. Lastly, fully connected layers are used in the output layer to minimize feature information loss, ensuring accurate detection of genomic variations.

DECoNT [99] is a comprehensive neural network that predicts precise CNVs and their categories, including deletion, duplication or no call, using a Bi-LSTM structure with 128 hidden neurons in each direction to analyze read depth signals. It includes a batch normalization layer and two fully connected layers, receiving inputs from the Bi-LSTM and previous CNV predictions, represented as one-hot-encoded vectors. The first fully connected layer contains 100 neurons, activated by the ReLU function, while the output layer consists of three neurons using softmax activation for event probability calculation. A weighted cross-entropy loss function optimizes the network for precise CNV prediction, mirroring the process used for categorized CNV prediction.

BreakNet [94] uses CNN and LSTM [104] networks to analyze genomic deletions. The CNN module downsamples the input matrix from the average pooling layer to lessen computational load and applies six convolution blocks, each with a conv2D layer, an SE optimization layer and a max pooling layer using ReLU activation function for non-linearity. The BRNN module has two Bi-LSTM layers with 64 LSTM units each, processing feature vectors from the CNN module and capture the information in both directions. Two fully connected layers categorize the vectors from the BRNN module, with a dropout applied after each layer to enhance generalization. The final outputs are computed using the sigmoid function.

MAMnet [95] combines CNN and LSTM models to detect genomic variations, processing the variation feature matrix as a single time step and transforming it into a vector using CNN. This reuses conv blocks, each comparing a convolution layer, a max pooling layer, a batch normalization layer and a squeeze-and-excitation (SE) optimization layer [105]. A Bi-LSTM network then captures temporal information across multiple steps in both directions. Ultimately, three fully connected layers integrate the information to produce the final prediction. The first two fully connected layers are followed by a dropout layer, while the last fully connected layer uses a sigmoid activation function.

DeepCNV [89] uses the CNN and DNN to process image data and metadata, respectively. The CNN part has a chain of convolutional layers with 3 × 3 receptive fields and fixed 1 pixel stride filters using LeakyReLU [106] activation function and 2 × 2 max pooling. The metadata are modeled through a four-layer DNN. The output from both two branches is combined and passed into a 50 neurons fully connected layer, then a terminal sigmoid activation node. This final neuron generates a score indicating false- or true-positive samples. The model is trained with the RMSprop optimizer.

DeepSVFilter [93] applies transfer learning using pre-trained CNN models on the ImageNet dataset for classification [107–112]. Its architecture involves a dropout layer after the pre-trained layers, a fully connected layer and a softmax layer. It employs the Adam optimizer for training SV images, limits epochs to prevent overfitting and uses cross-entropy loss function to compare the predicted probability with the true category label. The trained CNN processes candidate SV images and assigns a score ranging from 0 to 1 to each SV, indicating the likelihood of a true SV.

SVision [97] is based on the AlexNet architecture [55] for classifying sequence differences in similarity images, which includes five convolutional layers and three fully connected layers. The first layer processes images with special dimensions. It trains the CNN using transfer learning, initializes parameters with the best parameter set from ImageNet competition and fine-tunes on training data. It uses the cross-entropy loss function and the extracted features during training can be used for traditional machine learning classification.

Cue [98] builds on the fourth-order stacked hourglass [113, 114] CNN, originally used for human pose estimation, and aims to consolidate information at multiple scales. It starts with a convolutional backbone module, followed by four hourglass modules, each comprising a residual module and max-pooling layers for downscaling, as well as upsampling layers and skip connections to restore the output resolution. After each hourglass module, Cue performs intermediate supervision to generate intermediate confidence map predictions and calculate loss. This enables iterative re-evaluate of estimates and features at every scale.

The types of SVs that can be detected by different methods vary significantly, for instance, DeepSV is specifically designed for detecting long deletions, while svBreak can identify seven common SV breakpoints. Methods for detecting SVs in short read are generally applicable to WGS data, with only a few suitable for WES data, such as DECoNT and CNV-espresso. Additionally, some methods like Cue can independently perform variant calling and variant type classification. Other methods, like DeepSVFilter and CNV-espresso, require integration with precursor variant calling software, such as Manta [44], Delly [45] or Lumpy [46].

CONCLUSION

The field of genomics is experiencing massive data growth in biomedical and computational biology. Analytical methods have shifted from basic statistics to deep learning algorithms in machine learning, making the process more data-driven. This paper reviews 20 methods of genome variation detection based on deep learning. Specifically, we divide these methods into small variation and structural variation detection, focusing on tensor encoding techniques, training sets and NN models. Deep learning methods represent genetic variations as images, transforming variant calling problems into image classification problems. This approach potentially outperforms traditional methods and represents a promising direction for general SV discovery. These methods, while primarily focusing on genetic variation detection, could also be used for somatic variation detection with appropriate training sets.

It's worth noting that each method has advantages and limitations, and users can choose the appropriate method based on the type of variation to be detected and the size of the data fragment. Variation types are mainly divided into small variation and SV. Small variation includes SNV and Indel, with most methods able to detect both SNV and Indel, whereas NanoSNP is primarily used for identifying SNV. SV includes duplication, deletion, insertion, inversion and other complex structural variation, among which duplication and deletion are collectively called CNV. For instance, DeepSV focuses on detecting deletion, DeepCNV is used for detecting CNV and SVision is for detecting CSV. Data fragment size is categorized into short read and long read. Short read data should be further distinguished between WGS and WES; for example, DeepSVFilter is used for WGS data and DECoNT for WES data. Small variation detection does not require differentiation, such as DeepVariant and CNNScoreVariants. Long read data need to be distinguished between being produced by PacBio or ONT platforms; for instance, BreakNet is mainly used for PacBio data, Clair is primarily for ONT data and MAMnet is suitable for data from both platforms. When multiple methods are available for the same variant type and data type, users can choose according to the order in the table we provide. Typically, methods listed later in the table are improvements on earlier ones; for example, Clair is a successor of Clairvoyante, and NanoCaller is a further improvement of Clair. Additionally, some methods can be used independently, such as DeepVariant and Cue, while others need to be integrated with other tools, like DeepSVFilter, CNV-espresso and Samplot-ML. We recommend that users choose the method that suits the type of variation and data they need to detect, referring to our table for guidance. If users have sufficient data, they may consider training models with their own data; if data are insufficient, it is advisable to use well-trained models.

Future research may focus on creating scalable tools that can manage complex variants, adapt to new technologies and integrate various sequencing techniques. Current deep learning methods struggle with accurately detecting both short and long reads at the same time, so further work should aim to support multiple sequencing platforms. Additionally, more features in alignment information should be explored to improve variant calling accuracy. Accurately identifying multiple types of SV remains difficult as no single method can precisely identify all types of genomic variations and sequencing data. Therefore, research should aim to develop deep learning methods that can precisely detect different types of SVs, while taking into account the unique characteristics of each sequencing data type.

Key Points

This article reviews the latest progress of deep learning methods in genome variant calling, including SNV and INDEL calling and SV calling.
We explore tensor encoding methods, which mainly include image-based and multi-dimensional tensor-based methods.
We analyze the datasets used for variant calling, with small variation detection typically using gold-standard datasets, while structural variation detection datasets often require some form of processing, including synthetic methods and pre-training methods.
We study neural network models, which mainly include three types: CNN-based models, RNN-based models and CNN-RNN hybrid models.

FUNDING

This work was supported by the National Nature Science Foundation of China (Project 62072140) and Heilongjiang Provincial Science and Technology Department (No: 2022ZXJ03C01).

Author Biographies

Ren Junjun is a PhD student at the Harbin Institute of Technology, School of Computer Science and Technology, Harbin, China.

Zhang Zhengqian is a PhD student at the Harbin Institute of Technology, School of Computer Science and Technology, Harbin, China.

Wu Ying is a Master’s student at the Harbin Institute of Technology, School of Computer Science and Technology, Harbin, China.

Wang Jialiang is a Master’s student at the Harbin Institute of Technology, School of Computer Science and Technology, Harbin, China.

Liu Yongzhuang is an associate professor at the Harbin Institute of Technology, School of Computer Science and Technology, Harbin, China.

References

1.

Altshuler

DM

,

Durbin

RM

,

Abecasis

GR

, et al.

A global reference for human genetic variation

.

Nature

2015

;

526

(

7571

):

68

–

74

.

2.

Kosugi

S

,

Momozawa

Y

,

Liu

XX

, et al.

Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing

.

Genome Biol

2019

;

20

:

18

.

3.

Alkan

C

,

Coe

BP

,

Eichler

EE

.

Applications of next-generation sequencing genome structural variation discovery and genotyping

.

Nat Rev Genet

2011

;

12

(

5

):

363

–

76

.

4.

Conrad

DF

,

Pinto

D

,

Redon

R

, et al.

Origins and functional impact of copy number variation in the human genome

.

Nature

2010

;

464

(

7289

):

704

–

12

.

5.

Mills

RE

,

Walter

K

,

Stewart

C

, et al.

Mapping copy number variation by population-scale genome sequencing

.

Nature

2011

;

470

(

7332

):

59

–

65

.

6.

Sudmant

PH

,

Rausch

T

,

Gardner

EJ

, et al.

An integrated map of structural variation in 2,504 human genomes

.

Nature

2015

;

526

(

7571

):

75

–

81

.

7.

Collins

RL

,

Brand

H

,

Redin

CE

, et al.

Defining the diverse spectrum of inversions, complex structural variation, and chromothripsis in the morbid human genome

.

Genome Biol

2017

;

18

:

21

.

8.

Weischenfeldt

J

,

Symmons

O

,

Spitz

F

,

Korbel

JO

.

Phenotypic impact of genomic structural variation: insights from and for human disease

.

Nat Rev Genet

2013

;

14

(

2

):

125

–

38

.

9.

Macintyre

G

,

Ylstra

B

,

Brenton

JD

.

Sequencing structural variants in cancer for precision therapeutics

.

Trends Genet

2016

;

32

(

9

):

530

–

42

.

10.

Stankiewicz

P

,

Lupski

JR

.

Structural variation in the human genome and its role in disease

.

Annu Rev Med

2010

;

61

:

437

–

55

.

11.

Collins

RL

,

Glessner

JT

,

Porcu

E

, et al.

A cross-disorder dosage sensitivity map of the human genome

.

Cell

2022

;

185

:

3041

–

3055.e25

.

12.

Dinneen

TJ

,

Ghrálaigh

FN

,

Walsh

R

, et al.

How does genetic variation modify ND-CNV phenotypes?

Trends Genet

2022

;

38

(

2

):

140

–

51

.

13.

Scott

AJ

,

Chiang

C

,

Hall

IM

.

Structural variants are a major source of gene expression differences in humans and often affect multiple nearby genes

.

Genome Res

2021

;

31

(

12

):

2249

–

57

.

14.

Shastry

BS

.

SNPs in disease gene mapping, medicinal drug development and evolution

.

J Hum Genet

2007

;

52

(

11

):

871

–

80

.

15.

Collins

RL

,

Brand

H

,

Karczewski

KJ

, et al.

A structural variation reference for medical and population genetics

.

Nature

2020

;

581

:

444

–

51

.

16.

Redon

R

,

Ishikawa

S

,

Fitch

KR

, et al.

Global variation in copy number in the human genome

.

Nature

2006

;

444

(

7118

):

444

–

54

.

17.

Goodwin

S

,

McPherson

JD

,

McCombie

WR

.

Coming of age: ten years of next-generation sequencing technologies

.

Nat Rev Genet

2016

;

17

(

6

):

333

–

51

.

18.

Nielsen

R

,

Paul

JS

,

Albrechtsen

A

,

Song

YS

.

Genotype and SNP calling from next-generation sequencing data

.

Nat Rev Genet

2011

;

12

(

6

):

443

–

51

.

19.

Li

H

.

Toward better understanding of artifacts in variant calling from high-coverage samples

.

Bioinformatics

2014

;

30

(

20

):

2843

–

51

.

20.

Goldfeder

RL

,

Priest

JR

,

Zook

JM

, et al.

Medical implications of technical accuracy in genome sequencing

.

Genome Med

2016

;

8

:

12

.

21.

DePristo

MA

,

Banks

E

,

Poplin

R

, et al.

A framework for variation discovery and genotyping using next-generation DNA sequencing data

.

Nat Genet

2011

;

43

:

491

–

8

.

22.

Li

H

.

FermiKit: assembly-based variant calling for Illumina resequencing data

.

Bioinformatics

2015

;

31

(

22

):

3694

–

6

.

23.

Chen

K

,

Chen

L

,

Fan

X

, et al.

TIGRA: a targeted iterative graph routing assembler for breakpoint assembly

.

Genome Res

2014

;

24

(

2

):

310

–

7

.

24.

Mahmoud

M

,

Gobet

N

,

Cruz-Dávalos

DI

, et al.

Structural variant calling: the long and the short of it

.

Genome Biol

2019

;

20

(

1

):

14

.

25.

Ho

SVS

,

Urban

AE

,

Mills

RE

.

Structural variation in the sequencing era

.

Nat Rev Genet

2020

;

21

(

3

):

171

–

89

.

26.

Jiang

Y

,

Turinsky

AL

,

Brudno

M

.

The missing indels: an estimate of indel variation in a human genome and analysis of factors that impede detection

.

Nucleic Acids Res

2015

;

43

(

15

):

7217

–

28

.

27.

Krumm

N

,

Sudmant

PH

,

Ko

A

, et al.

Copy number variation detection and genotyping from exome sequence data

.

Genome Res

2012

;

22

(

8

):

1525

–

32

.

28.

Ameur

A

,

Kloosterman

WP

,

Hestand

MS

.

Single-molecule sequencing: towards clinical applications

.

Trends Biotechnol

2019

;

37

(

1

):

72

–

85

.

29.

Van Hout

CV

,

Tachmazidou

I

,

Backman

JD

, et al.

Exome sequencing and characterization of 49,960 individuals in the UK biobank

.

Nature

2020

;

586

(

7831

):

749

–

56

.

30.

Ebert

P

,

Audano

PA

,

Zhu

QH

, et al.

Haplotype-resolved diverse human genomes and integrated analysis of structural variation

.

Science

2021

;

372

:eabf7117.

31.

Sedlazeck

FJ

,

Rescheneder

P

,

Smolka

M

, et al.

Accurate detection of complex structural variations using single-molecule sequencing

.

Nat Methods

2018

;

15

:

461

–

8

.

32.

Jiang

T

,

Liu

YZ

,

Jiang

Y

, et al.

Long-read-based human genomic structural variation detection with cuteSV

.

Genome Biol

2020

;

21

(

1

):

24

.

33.

Mantere

T

,

Kersten

S

,

Hoischen

A

.

Long-read sequencing emerging in medical genetics

.

Front Genet

2019

;

10

:

14

.

34.

Hastings

PJ

,

Lupski

JR

,

Rosenberg

SM

,

Ira

G

.

Mechanisms of change in gene copy number

.

Nat Rev Genet

2009

;

10

(

8

):

551

–

64

.

35.

Logsdon

GA

,

Vollger

MR

,

Eichler

EE

.

Long-read human genome sequencing and its applications

.

Nat Rev Genet

2020

;

21

(

10

):

597

–

614

.

36.

Eid

J

,

Fehr

A

,

Gray

J

, et al.

Real-time DNA sequencing from single polymerase molecules

.

Science

2009

;

323

(

5910

):

133

–

8

.

37.

Branton

D

,

Deamer

DW

,

Marziali

A

, et al.

The potential and challenges of nanopore sequencing

.

Nat Biotechnol

2008

;

26

(

10

):

1146

–

53

.

38.

Sedlazeck

FJ

,

Lee

H

,

Darby

CA

,

Schatz

MC

.

Piercing the dark matter: bioinformatics of long-range sequencing and mapping

.

Nat Rev Genet

2018

;

19

(

6

):

329

–

46

.

39.

Huddleston

J

,

Chaisson

MJP

,

Steinberg

KM

, et al.

Discovery and genotyping of structural variation from long-read haploid genome sequence data

.

Genome Res

2017

;

27

(

5

):

677

–

85

.

40.

Jain

M

,

Koren

S

,

Miga

KH

, et al.

Nanopore sequencing and assembly of a human genome with ultra-long reads

.

Nat Biotechnol

2018

;

36

:

338

–

45

.

41.

Chen

Y

,

Nie

F

,

Xie

SQ

, et al.

Efficient assembly of nanopore reads via highly accurate and intact error correction

.

Nat Commun

2021

;

12

(

1

):

10

.

42.

Ni

P

,

Huang

N

,

Nie

F

, et al.

Genome-wide detection of cytosine methylations in plant from Nanopore data using deep learning

.

Nat Commun

2021

;

12

(

1

):

11

.

43.

Wenger

AM

,

Peluso

P

,

Rowell

WJ

, et al.

Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome

.

Nat Biotechnol

2019

;

37

:

1155

–

62

.

44.

Chen

XY

,

Schulz-Trieglaff

O

,

Shaw

R

, et al.

Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications

.

Bioinformatics

2016

;

32

(

8

):

1220

–

2

.

45.

Rausch

T

,

Zichner

T

,

Schlattl

A

, et al.

DELLY: structural variant discovery by integrated paired-end and split-read analysis

.

Bioinformatics

2012

;

28

(

18

):

I333

–

9

.

46.

Layer

RM

,

Chiang

C

,

Quinlan

AR

,

Hall

IM

.

LUMPY: a probabilistic framework for structural variant discovery

.

Genome Biol

2014

;

15

(

6

):

R84

.

47.

Wala

JA

,

Bandopadhayay

P

,

Greenwald

NF

, et al.

SvABA: genome-wide detection of structural variants and indels by local assembly

.

Genome Res

2018

;

28

(

4

):

581

–

91

.

48.

Nagasaki

M

,

Yasuda

J

,

Katsuoka

F

, et al.

Rare variant discovery by deep whole-genome sequencing of 1,070 Japanese individuals

.

Nat Commun

2015

;

6

:

13

.

49.

Robinson

JT

,

Thorvaldsdóttir

H

,

Wenger

AM

, et al.

Variant review with the integrative genomics viewer

.

Cancer Res

2017

;

77

(

21

):

E31

–

4

.

50.

Quinlan

AR

,

Hall

IM

.

Characterizing complex structural variation in germline and somatic genomes

.

Trends Genet

2012

;

28

(

1

):

43

–

53

.

51.

Michaelson

JJ

,

Sebat

J

.

forestSV: structural variant discovery through statistical learning

.

Nat Methods

2012

;

9

:

819

–

21

.

52.

Antaki

D

,

Brandler

WM

,

Sebat

J

.

SV2: accurate structural variation genotyping andde novomutation detection from whole genomes

.

Bioinformatics

2018

;

34

(

10

):

1774

–

7

.

53.

Breiman

L

.

Random forests

.

Machine Learning

2001

;

45

(

1

):

5

–

32

.

54.

Cortes

C

,

Vapnik

V

.

Support-vector networks

.

Machine Learning

1995

;

20

(

3

):

273

–

97

.

55.

Krizhevsky

A

,

Sutskever

I

,

Hinton

GE

.

ImageNet classification with deep convolutional neural networks

.

Communications of the Acm

2017

;

60

(

6

):

84

–

90

.

56.

Wu

Y

,

Schuster

M

,

Chen

Z

, et al.

Google's neural machine translation system: bridging the gap between human and machine translation

. arXiv:1609.08144, 2016.

57.

Silver

D

,

Huang

A

,

Maddison

CJ

, et al.

Mastering the game of go with deep neural networks and tree search

.

Nature

2016

;

529

:

484

–

9

.

58.

Mnih

V

,

Kavukcuoglu

K

,

Silver

D

, et al.

Human-level control through deep reinforcement learning

.

Nature

2015

;

518

(

7540

):

529

–

33

.

59.

Min

S

,

Lee

B

,

Yoon

S

.

Deep learning in bioinformatics

.

Brief Bioinform

2017

;

18

(

5

):

851

–

69

.

60.

Alipanahi

B

,

Delong

A

,

Weirauch

MT

,

Frey

BJ

.

Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning

.

Nat Biotechnol

2015

;

33

:

831

–

8

.

61.

Zhou

J

,

Troyanskaya

OG

.

Predicting effects of noncoding variants with deep learning-based sequence model

.

Nat Methods

2015

;

12

(

10

):

931

–

4

.

62.

Xiong

HY

,

Alipanahi

B

,

Lee

LJ

, et al.

The human splicing code reveals new insights into the genetic determinants of disease

.

Science

2015

;

347

(

6218

):

9

.

63.

Poplin

R

,

Chang

PC

,

Alexander

D

, et al.

A universal SNP and small-indel variant caller using deep neural networks

.

Nat Biotechnol

2018

;

36

:

983

–

7

.

64.

Zerbino

DR

,

Birney

E

.

Velvet: algorithms for de novo short read assembly using de Bruijn graphs

.

Genome Res

2008

;

18

(

5

):

821

–

9

.

65.

Tang

M

,

Hasan

MS

,

Zhu

HX

, et al.

Vi-HMM: a novel HMM-based method for sequence variant identification in short-read data

.

Hum Genomics

2019

;

13

:

12

.

66.

Smith

TF

,

Waterman

MS

.

Identification of common molecular subsequences

.

J Mol Biol

1981

;

147

(

1

):

195

–

7

.

67.

Friedman

S

,

Gauthier

L

,

Farjoun

Y

,

Banks

E

.

Lean and deep models for more accurate filtering of SNP and INDEL variant calls

.

Bioinformatics

2020

;

36

(

7

):

2060

–

7

.

68.

Luo

RB

,

Sedlazeck

FJ

,

Lam

TW

, et al.

A multi-task convolutional deep neural network for variant calling in single molecule sequencing

.

Nat Commun

2019

;

10

:

11

.

69.

Luo

RB

,

Wong

CL

,

Wong

YS

, et al.

Exploring the limit of using a deep neural network on pileup data for germline variant calling

.

Nature Machine Intelligence

2020

;

2

(

4

):

220

–

7

.

70.

Ahsan

MU

,

Liu

Q

,

Fang

L

,

Wang

K

.

NanoCaller for accurate detection of SNPs and indels in difficult-to-map regions from long-read sequencing by haplotype-aware deep neural networks

.

Genome Biol

2021

;

22

(

1

):

33

.

71.

Shafin

K

,

Pesout

T

,

Chang

PC

, et al.

Haplotype-aware variant calling with PEPPER-margin-DeepVariant enables high accuracy in nanopore long-reads

.

Nat Methods

2021

;

18

:

1322

–

32

.

72.

Zheng

ZX

,

Li

SM

,

Su

JH

, et al.

Symphonizing pileup and full-alignment for deep learning-based long-read variant calling

.

Nature Computational Science

2022

;

2

:

797

–

803

.

73.

Huang

N

,

Xu

MH

,

Nie

F

, et al.

NanoSNP: a progressive and haplotype-aware SNP caller on low-coverage nanopore sequencing data

.

Bioinformatics

2023

;

39

(

1

):

9

.

74.

Wagner

J

,

Olson

ND

,

Harris

L

, et al.

Benchmarking challenging small variants with linked and long reads

.

Cell genomics

2022

;

2

(

5

):

100128

.

75.

Olson

ND

,

Wagner

J

,

McDaniel

J

, et al.

PrecisionFDA truth challenge V2: calling variants from short and long reads in difficult-to-map regions

.

Cell genomics

2022

;

2

(

5

):

100129

.

76.

Zook

JM

,

Salit

M

.

Genomes in a bottle: creating standard reference materials for genomic variation—why, what and how?

Genome Biol

2011

;

12

:

18

–

8

.

77.

Zook

JM

,

Chapman

B

,

Wang

J

, et al.

Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls

.

Nat Biotechnol

2014

;

32

(

3

):

246

–

51

.

78.

LeCun

Y

,

Bengio

Y

,

Hinton

G

.

Deep learning

.

Nature

2015

;

521

(

7553

):

436

–

44

.

79.

Zhang

QR

,

Zhang

M

,

Chen

TH

, et al.

Recent advances in convolutional neural network acceleration

.

Neurocomputing

2019

;

323

:

37

–

51

.

80.

Alom

MZ

,

Taha

TM

,

Yakopcic

C

, et al.

A state-of-the-art survey on deep learning theory and architectures

.

Electronics

2019

;

8

(

3

):

66

.

81.

Szegedy

C

,

Vanhoucke

V

,

Ioffe

S

, et al. Rethinking the inception architecture for computer vision. In:

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

.

Seattle, WA

:

IEEE

,

2016

,

2818

–

26

.

82.

Ioffe

S

,

Szegedy

C

. Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Bach F, Blei D (eds).

32nd International Conference on Machine Learning

.

Lille, France

:

Jmlr-Journal Machine Learning Research

,

2015

,

448

–

56

.

Google Preview

83.

Szegedy

C

,

Liu

W

,

Jia

YQ

, et al. Going deeper with convolutions. In:

IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

.

Boston, MA

:

IEEE

,

2015

,

1

–

9

.

Google Preview

84.

Hochreiter

S

,

Schmidhuber

J

.

Long short-term memory

.

Neural Comput

1997

;

9

(

8

):

1735

–

80

.

85.

He

KM

,

Zhang

XY

,

Ren

SQ

,

Sun

J

.

Spatial pyramid pooling in deep convolutional networks for visual recognition

.

IEEE Trans Pattern Anal Mach Intell

2015

;

37

(

9

):

1904

–

16

.

86.

Cai

L

,

Wu

YF

,

Gao

JY

.

DeepSV: accurate calling of genomic deletions from high-throughput sequencing data using deep convolutional neural network

.

BMC Bioinformatics

2019

;

20

(

1

):

17

.

87.

Chowdhury

M

,

Layer

RM

.

Learning what a good structural variant looks like

. bioRxiv 2020.

88.

Belyeu

JR

,

Chowdhury

M

,

Brown

J

, et al.

Samplot: a platform for structural variant visual validation and automated filtering

.

Genome Biol

2021

;

22

(

1

):

13

.

89.

Glessner

JT

,

Hou

XR

,

Zhong

C

, et al.

DeepCNV: a deep learning approach for authenticating copy number variations

.

Brief Bioinform

2021

;

22

(

5

):

10

.

90.

Wang

K

,

Li

MY

,

Hadley

D

, et al.

PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data

.

Genome Res

2007

;

17

(

11

):

1665

–

74

.

91.

Lima

LD

,

Wang

K

.

PennCNV in whole-genome sequencing data

.

BMC Bioinformatics

2017

;

18

:

8

.

92.

Tan

RJ

,

Shen

YF

.

Accurate in silico confirmation of rare copy number variant calls from exome sequencing data using transfer learning

.

Nucleic Acids Res

2022

;

50

(

21

):

8

.

93.

Liu

YZ

,

Huang

YL

,

Wang

GH

,

Wang

Y

.

A deep learning approach for filtering structural variants in short read sequencing data

.

Brief Bioinform

2021

;

22

(

4

):

9

.

94.

Luo

JW

,

Ding

HY

,

Shen

JQ

, et al.

BreakNet: detecting deletions using long reads and a deep learning approach

.

BMC Bioinformatics

2021

;

22

(

1

):

13

.

95.

Ding

HY

,

Luo

JW

.

MAMnet: detecting and genotyping deletions and insertions based on long reads and a deep learning approach

.

Brief Bioinform

2022

;

23

(

5

):

10

.

96.

Wang

SQ

,

Li

J

,

Haque

AKA

, et al.

svBreak: a new approach for the detection of structural variant breakpoints based on convolutional neural network

.

Biomed Res Int

2022

;

2022

:

1

–

8

.

97.

Lin

JD

,

Wang

SB

,

Audano

PA

, et al.

SVision: a deep learning approach to resolve complex structural variants

.

Nat Methods

2022

;

19

:

1230

–

3

.

98.

Popic

V

,

Rohlicek

C

,

Cunial

F

, et al.

Cue: a deep-learning framework for structural variant discovery and genotyping

.

Nat Methods

2023;

20

:559–568.

99.

Özden

F

,

Alkan

C

,

Çiçek

AE

.

Polishing copy number variant calls on exome sequencing data via deep learning

.

Genome Res

2022

;

32

(

6

):

1170

–

82

.

100.

Becker

TJ

,

Shin

DG

. TensorSV: structural variation inference using tensors and variable topology neural networks. In: Park T, Cho YR, Hu X, et al. (eds).

IEEE International Conference on Bioinformatics and Biomedicine (IEEE BIBM). Electr Network

. Publisher:

IEEE Computer Soc

,

2020

, pp.

1356

–

60

.

101.

Chaisson

MJP

,

Sanders

AD

,

Zhao

XF

, et al.

Multi-platform discovery of haplotype-resolved structural variation in human genomes

.

Nat Commun

2019

;

10

:

16

.

102.

Bolognini

D

,

Sanders

A

,

Korbel

JO

, et al.

VISOR: a versatile haplotype-aware structural variant simulator for short- and long-read sequencing

.

Bioinformatics

2020

;

36

(

4

):

1267

–

9

.

103.

Yarotsky

D

.

Error bounds for approximations with deep ReLU networks

.

Neural Netw

2017

;

94

:

103

–

14

.

104.

Sherstinsky

A

.

Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network

.

Physica D

2020

;

404

:

132306

.

105.

Hu

J

,

Shen

L

,

Albanie

S

, et al.

Squeeze-and-excitation networks

.

IEEE Trans Pattern Anal Mach Intell

2020

;

42

(

8

):

2011

–

23

.

106.

Anthimopoulos

M

,

Christodoulidis

S

,

Ebner

L

, et al.

Lung pattern classification for interstitial lung diseases using a deep convolutional neural network

.

IEEE Trans Med Imaging

2016

;

35

(

5

):

1207

–

16

.

107.

Deng

J

,

Dong

W

,

Socher

R

, et al. ImageNet: a large-scale hierarchical image database. In:

IEEE-Computer-Society Conference on Computer Vision and Pattern Recognition Workshops

.

Miami Beach, FL

:

IEEE

,

2009

,

248

–

55

.

Google Preview