-
PDF
- Split View
-
Views
-
Cite
Cite
Ren Junjun, Zhang Zhengqian, Wu Ying, Wang Jialiang, Liu Yongzhuang, A comprehensive review of deep learning-based variant calling methods, Briefings in Functional Genomics, Volume 23, Issue 4, July 2024, Pages 303–313, https://doi.org/10.1093/bfgp/elae003
- Share Icon Share
Abstract
Genome sequencing data have become increasingly important in the field of personalized medicine and diagnosis. However, accurately detecting genomic variations remains a challenging task. Traditional variation detection methods rely on manual inspection or predefined rules, which can be time-consuming and prone to errors. Consequently, deep learning–based approaches for variation detection have gained attention due to their ability to automatically learn genomic features that distinguish between variants. In our review, we discuss the recent advancements in deep learning–based algorithms for detecting small variations and structural variations in genomic data, as well as their advantages and limitations.
INTRODUCTION
Genetic variants can be classified into three main categories based on their size: single-nucleotide variants (SNVs) [1], short insertions and deletions (Indels ≤ 50 bp) and structural variants (SVs > 50 bp) [2], which can include deletion, insertion, duplication, inversion and translocation mutations [3]. Deletions and duplications of genomic fragments are referred to as copy number variations (CNVs) [4, 5]. Moreover, complex structural variants (CSVs) can occur through a combination of simple SV events [6, 7]. Variant calling refers to identifying nucleotide differences in an individual's genome relative to a reference sequence, which is critical for understanding human phenotypic diversity and human diseases such as cancer [8–14]. It plays a vital role in both research and clinical applications of human genome sequencing [15, 16]. Although there have been significant improvements in sequencing technology, accurately detecting genetic variation from billions of short and noisy sequence reads remains a challenging task [17–21].
When detecting SNVs and short Indels, the conventional approach is to identify non-reference bases in a collection of reads that cover each position. Probabilistic modeling plays a crucial role in inferring potential genotypes or estimating the probability of variation compared to artifacts.
When detecting SVs, there are two primary categories of calling methods. De novo assembly–based methods assemble original reads into a longer sequence and compare it with a reference genome [22–24]. These methods are theoretically capable of detecting all types of variations and are less influenced by the reference sequence. However, accurately assembling a genome sequence can be quite challenging, especially when dealing with heterogeneous sequences. On the other hand, read alignment–based methods detect variations by directly aligning short paired reads or long reads with a reference genome. Whole genome sequencing (WGS) data from short reads is typically characterized by the read depth, discordant read pair and split read [3, 25]. However, accurate variant calling in whole exome sequencing (WES) data has been largely limited by technical issues, such as a high error rate [26–29]. Long reads can effectively span areas of high repetition or low complexity, thereby improving alignment quality and enabling the detection of more SVs compared to short reads [25, 30–35]. The two main single-molecule sequencing data platforms, Pacific Biosciences (PacBio) [36] and Oxford Nanopore Technologies (ONT) [37], have significantly enhanced the performance of various genome applications, particularly genome assembly and variant calling [38–43].
Despite the advances in sequencing technology and the significant reduction in sequencing costs, enabling high coverage and a significant decrease in sequencing error rates, accurately detecting all variations in the human genome is still a challenge due to factors such as the complexity of the genome itself [44–47]. To address this, various methods are utilized in large-scale genome studies and integrated into a unified identification set [6, 48]. However, false-positive variant calling remains an issue [25], and heuristic filters and manual examination using software programs are commonly used [49, 50]. Nevertheless, these methods can be time-consuming and difficult to optimize for different sequencing datasets. An effective model-based variation detection method is urgently needed.
Machine learning techniques typically treat variant calling as a classification task, which involves calling and filtering genomic variations, and use supervised learning to develop models that can predict the presence or absence of variants. For instance, forests [51] and SV2 [52] utilize read comparisons to generate features and employ random forest models [53] and support vector machines [54], respectively, to detect SVs.
Deep learning, a machine learning technique that has gained popularity, is being used in diverse fields such as image recognition [55], language translation [56], gaming [57, 58] and life sciences [59–62]. In genomics, deep learning models have shown promise in accurately calling genetic variants, surpassing traditional methods. The introduction of DeepVariant [63], the first deep learning–based variant calling method, marked a shift toward deep learning approaches in contrast to traditional statistical methods. Deep learning methods, led by DeepVariant, have dominated short-read variant calling and have also made progress in long-read variant calling, overcoming challenges posed by high base error rates. Overall, advancements in sequencing technology have greatly enhanced the detection of genetic variations, opening up new possibilities in genomics research and clinical applications.
The present review aims to provide a comprehensive overview of deep learning–based approaches for variant calling, elucidating and comparing them in detail. Deep learning approaches are particularly noteworthy because they aim to reduce expert input and foster the increasing automation of processes. This review is divided into two primary sections, focusing firstly on small variations and secondly on structural variations. It covers key topics including tensor coding, training datasets and neural network (NN) architectures employed in variant calling. Each section of the review comprises specific research examples.
DEEP LEARNING OF GENOME VARIANT CALLING
In this paper, we provide a comprehensive review of variation detection techniques that are based on deep learning models. To better understand these methods, we present a general workflow for variation detection, as illustrated in Figure 1. Some tools are standalone, meaning that variant calling and deep learning are integrated, while other tools require combination with other precursor variant calling software to complete the data preprocessing part.

General workflow of variant calling methods based on deep learning. The upper part is the input & preprocess. The lower part is the deep learning methods. First, read is aligned with the reference genome for variant detection, obtaining candidate variants. Then, the candidate variants are encoded and input into a trained neural network to obtain high-confidence variant calls.
SNV and Indel calling
In this section, we introduce several classic deep learning methods for SNV and Indel detection. We summarize these methods, as shown in Table 1.
Model . | Variant type . | Fragment size . | Region . | Tensor encoding . | Training set . | Neural network . | Advantage & limitation . | Source code . | Year . |
---|---|---|---|---|---|---|---|---|---|
DeepVariant | SNV, INDEL | Short read | WGS, WES | Pileup image | Public | CNN | High accuracy, robust reliability; considerable computational resources | https://github.com/google/deepvariant/ | 2018 |
CNNScoreVariants | SNV, INDEL | Short read | WGS, WES | Pileup image | Public | CNN | Improved coding technique; enhanced predictive accuracy; coding redundancy | https://github.com/broadinstitute/gatk | 2020 |
Clairvoyante | SNV, INDEL | Long read (PacBio, ONT) | WGS | Pileup image | Public | CNN | Few parameters; outstanding performance in long-read technology; cannot identify multi-allelic variants or Indels ≥4 bases; not considering the base quality | https://github.com/aquaskyline/Clairvoyante | 2019 |
Clair | SNV, INDEL | Long read (PacBio, ONT) | WGS | Pileup image | Public | Bi-LSTM | Considerable improvements in precision, recall and speed; Indel calling of Nanopore data needs to be improved | https://github.com/HKUBAL/Clair | 2020 |
NanoCaller | SNV, INDEL | Long read (PacBio, ONT) | WGS | Pileup image | Public | CNN | Combine long-range haplotype information; few parameters; incorrect alignments in low-complexity regions; potential inaccuracy in detecting Indels in nucleotide repeats of Nanopore data | https://github.com/WGLab/NanoCaller | 2021 |
PEPPER-Margin-DeepVariant | SNV, INDEL | Long read (PacBio) | WGS | Full-alignment image | Public | CNN+ RNN | Haplotype-aware; encode more features; advanced variant calling results on Nanopore data; low Indel calling accuracy on Nanopore data | PEPPER: https://github.com/kishwarshafin/pepper Margin: https://github.com/UCSC-nanopore-cgl/margin DeepVariant: https://github.com/google/deepvariant | 2021 |
Clair3 | SNV, INDEL | Long read (ONT) | WGS | Full-alignment image and pileup image | Public | CNN + Bi-LSTM | Combine pileup-based and full alignment variant calling; fast runtime and excellent performance | https://github.com/HKUBAL/Clair3 | 2022 |
NanoSNP | SNV | Long read (ONT) | WGS | Pileup image and haplotype image | Public | Bi-LSTM | Combine long-range haplotype feature and short-range pileup feature; best performance on Nanopore data; cannot identify SNPs with short reads | https://github.com/huangnengCSU/NanoSNP.git | 2023 |
Model . | Variant type . | Fragment size . | Region . | Tensor encoding . | Training set . | Neural network . | Advantage & limitation . | Source code . | Year . |
---|---|---|---|---|---|---|---|---|---|
DeepVariant | SNV, INDEL | Short read | WGS, WES | Pileup image | Public | CNN | High accuracy, robust reliability; considerable computational resources | https://github.com/google/deepvariant/ | 2018 |
CNNScoreVariants | SNV, INDEL | Short read | WGS, WES | Pileup image | Public | CNN | Improved coding technique; enhanced predictive accuracy; coding redundancy | https://github.com/broadinstitute/gatk | 2020 |
Clairvoyante | SNV, INDEL | Long read (PacBio, ONT) | WGS | Pileup image | Public | CNN | Few parameters; outstanding performance in long-read technology; cannot identify multi-allelic variants or Indels ≥4 bases; not considering the base quality | https://github.com/aquaskyline/Clairvoyante | 2019 |
Clair | SNV, INDEL | Long read (PacBio, ONT) | WGS | Pileup image | Public | Bi-LSTM | Considerable improvements in precision, recall and speed; Indel calling of Nanopore data needs to be improved | https://github.com/HKUBAL/Clair | 2020 |
NanoCaller | SNV, INDEL | Long read (PacBio, ONT) | WGS | Pileup image | Public | CNN | Combine long-range haplotype information; few parameters; incorrect alignments in low-complexity regions; potential inaccuracy in detecting Indels in nucleotide repeats of Nanopore data | https://github.com/WGLab/NanoCaller | 2021 |
PEPPER-Margin-DeepVariant | SNV, INDEL | Long read (PacBio) | WGS | Full-alignment image | Public | CNN+ RNN | Haplotype-aware; encode more features; advanced variant calling results on Nanopore data; low Indel calling accuracy on Nanopore data | PEPPER: https://github.com/kishwarshafin/pepper Margin: https://github.com/UCSC-nanopore-cgl/margin DeepVariant: https://github.com/google/deepvariant | 2021 |
Clair3 | SNV, INDEL | Long read (ONT) | WGS | Full-alignment image and pileup image | Public | CNN + Bi-LSTM | Combine pileup-based and full alignment variant calling; fast runtime and excellent performance | https://github.com/HKUBAL/Clair3 | 2022 |
NanoSNP | SNV | Long read (ONT) | WGS | Pileup image and haplotype image | Public | Bi-LSTM | Combine long-range haplotype feature and short-range pileup feature; best performance on Nanopore data; cannot identify SNPs with short reads | https://github.com/huangnengCSU/NanoSNP.git | 2023 |
Model . | Variant type . | Fragment size . | Region . | Tensor encoding . | Training set . | Neural network . | Advantage & limitation . | Source code . | Year . |
---|---|---|---|---|---|---|---|---|---|
DeepVariant | SNV, INDEL | Short read | WGS, WES | Pileup image | Public | CNN | High accuracy, robust reliability; considerable computational resources | https://github.com/google/deepvariant/ | 2018 |
CNNScoreVariants | SNV, INDEL | Short read | WGS, WES | Pileup image | Public | CNN | Improved coding technique; enhanced predictive accuracy; coding redundancy | https://github.com/broadinstitute/gatk | 2020 |
Clairvoyante | SNV, INDEL | Long read (PacBio, ONT) | WGS | Pileup image | Public | CNN | Few parameters; outstanding performance in long-read technology; cannot identify multi-allelic variants or Indels ≥4 bases; not considering the base quality | https://github.com/aquaskyline/Clairvoyante | 2019 |
Clair | SNV, INDEL | Long read (PacBio, ONT) | WGS | Pileup image | Public | Bi-LSTM | Considerable improvements in precision, recall and speed; Indel calling of Nanopore data needs to be improved | https://github.com/HKUBAL/Clair | 2020 |
NanoCaller | SNV, INDEL | Long read (PacBio, ONT) | WGS | Pileup image | Public | CNN | Combine long-range haplotype information; few parameters; incorrect alignments in low-complexity regions; potential inaccuracy in detecting Indels in nucleotide repeats of Nanopore data | https://github.com/WGLab/NanoCaller | 2021 |
PEPPER-Margin-DeepVariant | SNV, INDEL | Long read (PacBio) | WGS | Full-alignment image | Public | CNN+ RNN | Haplotype-aware; encode more features; advanced variant calling results on Nanopore data; low Indel calling accuracy on Nanopore data | PEPPER: https://github.com/kishwarshafin/pepper Margin: https://github.com/UCSC-nanopore-cgl/margin DeepVariant: https://github.com/google/deepvariant | 2021 |
Clair3 | SNV, INDEL | Long read (ONT) | WGS | Full-alignment image and pileup image | Public | CNN + Bi-LSTM | Combine pileup-based and full alignment variant calling; fast runtime and excellent performance | https://github.com/HKUBAL/Clair3 | 2022 |
NanoSNP | SNV | Long read (ONT) | WGS | Pileup image and haplotype image | Public | Bi-LSTM | Combine long-range haplotype feature and short-range pileup feature; best performance on Nanopore data; cannot identify SNPs with short reads | https://github.com/huangnengCSU/NanoSNP.git | 2023 |
Model . | Variant type . | Fragment size . | Region . | Tensor encoding . | Training set . | Neural network . | Advantage & limitation . | Source code . | Year . |
---|---|---|---|---|---|---|---|---|---|
DeepVariant | SNV, INDEL | Short read | WGS, WES | Pileup image | Public | CNN | High accuracy, robust reliability; considerable computational resources | https://github.com/google/deepvariant/ | 2018 |
CNNScoreVariants | SNV, INDEL | Short read | WGS, WES | Pileup image | Public | CNN | Improved coding technique; enhanced predictive accuracy; coding redundancy | https://github.com/broadinstitute/gatk | 2020 |
Clairvoyante | SNV, INDEL | Long read (PacBio, ONT) | WGS | Pileup image | Public | CNN | Few parameters; outstanding performance in long-read technology; cannot identify multi-allelic variants or Indels ≥4 bases; not considering the base quality | https://github.com/aquaskyline/Clairvoyante | 2019 |
Clair | SNV, INDEL | Long read (PacBio, ONT) | WGS | Pileup image | Public | Bi-LSTM | Considerable improvements in precision, recall and speed; Indel calling of Nanopore data needs to be improved | https://github.com/HKUBAL/Clair | 2020 |
NanoCaller | SNV, INDEL | Long read (PacBio, ONT) | WGS | Pileup image | Public | CNN | Combine long-range haplotype information; few parameters; incorrect alignments in low-complexity regions; potential inaccuracy in detecting Indels in nucleotide repeats of Nanopore data | https://github.com/WGLab/NanoCaller | 2021 |
PEPPER-Margin-DeepVariant | SNV, INDEL | Long read (PacBio) | WGS | Full-alignment image | Public | CNN+ RNN | Haplotype-aware; encode more features; advanced variant calling results on Nanopore data; low Indel calling accuracy on Nanopore data | PEPPER: https://github.com/kishwarshafin/pepper Margin: https://github.com/UCSC-nanopore-cgl/margin DeepVariant: https://github.com/google/deepvariant | 2021 |
Clair3 | SNV, INDEL | Long read (ONT) | WGS | Full-alignment image and pileup image | Public | CNN + Bi-LSTM | Combine pileup-based and full alignment variant calling; fast runtime and excellent performance | https://github.com/HKUBAL/Clair3 | 2022 |
NanoSNP | SNV | Long read (ONT) | WGS | Pileup image and haplotype image | Public | Bi-LSTM | Combine long-range haplotype feature and short-range pileup feature; best performance on Nanopore data; cannot identify SNPs with short reads | https://github.com/huangnengCSU/NanoSNP.git | 2023 |
Tensor encoding
Variant calling like DeepVariant [63] is treated as an image classification problem using tensor encoding. Sequencing data are turned into an image and analyzed to identify genetic variations. DeepVariant learns the relationship between read pileup image and true genotype calls for accurate variant identification. The process starts with identifying candidate SNPs and Indels, processing mapped reads with a local read assembly procedure based on De Bruijn graph [64] and selecting best haplotypes with a hidden Markov model (HMM) [65]. The Smith–Waterman-like algorithm [66] is used for read realignment, and only high-quality reads are considered for variant calling. Candidate gene reaching the threshold are encoded as 3-channel RGB pileup images, with the first row representing the reference sequence and the remaining rows representing reads, resulting in one image per candidate site.
CNNScoreVariants [67] retains both read and reference sequence information, adding mapping quality and read flags. Unlike DeepVariant, it encodes base qualities into the read tensor’s base channel, treating it as a confidence level for each base call. It uses one-hot and p-hot encoding for reference and read tensors, respectively. The reference tensor has a two-dimensional (2D) reference base centered at a variant, and the read tensor is a three-dimensional tensor spanning different genomic sites in width and different read pileups in height. The reference tensor’s channels are the four DNA bases, while the read tensor’s first four channels encode base quality, and the remaining five channels encode read flags representing the chain, pairing and mapping quality.
Clairvoyante [68], specifically designed for SNP and Indel calls in single-molecule sequencing data, uses a three-dimensional tensor to encode information about the read and reference sequence. The tensor’s dimensions represent the site, count of the four bases on the reads and four different counting methods. In the third dimension, four different counting methods are used to generate four tensors: (1) for the reference sequence and supporting reads, (2) for inserted sequences, (3) for deleted base pairs and (4) for alternative alleles.
Clair [69] is an improved version of Clairvoyante, which introduces four new tasks in deep learning to address the limitations of Clairvoyante, including multi-allelic variant calling and long Indel calling. Clair employs a similar encoding method as Clairvoyante, but with a second dimension in its three-dimensional tensor that is twice the size, representing four base positive/negative strand counts.
NanoCaller [70] is designed for long read sequencing data, using long-range haplotype information and phased reads to improve variant calling accuracy. Unlike other tools such as DeepVariant, Clairvoyante and Clair, it only considers heterozygous SNP sites far from the candidate site in the pileup images. The read and reference sequence information of SNP candidate sites is encoded into a three-dimensional tensor, including alternative alleles, candidate sites and reference sequence bases. The pileup image represents different bases at the candidate site using five channels, with the fifth channel representing the reference sequence. For Indel candidate sites, the pileup image is also encoded as a three-dimensional tensor, combining all reads, reads in one phase and reads in the other phase at the candidate site matrices. The three dimensions denote bases or deletion, pileup columns of realigned sequences and two matrices.
PEPPER-Margin-DeepVariant [71] and NanoCaller both use a haplotype-aware strategy for variant calling in long-read sequencing data. PEPPER-SNP, a submodel of PEPPER, calls SNPs using tensor encoding and represents each genomic site with 10 features. Different bases are color-coded, with each row and column representing a feature and reference genome site, respectively. Observations are coded as weights, shown as the alpha of each base. Another submodule, PEPPER-HP, considers SNVs and Indels as candidate variants and generates haplotype-specific likelihoods for each candidate variant. Similar to PEPPER-SNP, PEPPER-HP uses an encoding in which each column represents a reference position with two values, indicating the reference sequence site and the insertion alleles targeted at that site.
Clairvoyante, Clair and NanoCaller are based on pileup, which has an advantage in terms of time efficiency. PEPPER-Margin-DeepVariant is based on full-alignment variant calling, which provides the highest precision and recall. Clair3 [72], as the successor of Clair, combines the advantages of both methods by using full alignment for difficult variant candidates and pileup calling for the majority of candidates, resulting in fast runtime and excellent performance. Clair3’s input includes a 2D tensor for pileup and a three-dimensional tensor for full alignment, encoding genome site, features and various information related to the variant calling process.
NanoSNP [73] is a SNP calling method for low-coverage Nanopore sequencing reads, combining long-range haplotype and short-range pileup features for precise SNP identification. The process starts with a pileup model predicting SNP sites by extracting pileup features from aligned reads and using these SNPs for read phasing. The haplotype model then use both type information to extract relevant data for each haplotype, including distribution of nucleotide, base quality and mapping quality. This data forms the feature tensor, used to generate a pileup image for the candidate SNP site. Moreover, a group of high-quality heterozygous SNP sites adjacent to the candidate SNP site is selected to form the long-range haplotype image.
Training set
Deep learning classification heavily relies on the quality of the training labels. In order to achieve high-performance inference, having a gold-standard dataset is crucial. Due to the availability of multiple benchmark datasets, SNV and Indel variant detection methods based on deep learning often use publicly available datasets for training. However, it is important to carefully select the training datasets as it directly impacts the model’s performance and robustness. The Genome in a Bottle (GIAB) project [74–76] is a widely used public database that contains samples with high-quality genomic sequences and is often used as a source of training datasets for these methods [77], such as DeepVariant [63], Clairvoyante [68] and CNNScoreVariants [67].
Network architecture
Most deep learning frameworks for genome variant calling primarily rely on the convolutional neural network (CNN) [78, 79] model, while a few utilize the recurrent neural network (RNN) [80] model, and some methods incorporate both networks. The architecture of CNN models for SNV and Indel detection typically includes an input layer, convolutional layer, fully connected layer, softmax layer and output layer. RNN models, on the other hand, consist of an input layer, LSTM layer, fully connected layer, softmax layer and output layer. Some deep learning methods utilize two networks for variant calling. The network architecture is illustrated in Figure 2.

The hierarchical architecture of network layers in deep learning methods illustrated. Most deep learning frameworks for genome variant calling are based on the CNN model, while a few are based on the RNN model, and some methods incorporate both networks.
DeepVariant [63] is the first CNN-based approach for detecting genome variants. It adopts the Inception architecture [81–83]. Specifically, an image input layer is first created, which is appended to the ConvNetJuly2015v2 [81] CNN with nine partitions. The final output layer is a softmax layer with three categories, representing the probabilities of the three genotypes, fully connected to the previous layer.
CNNScoreVariants [67] is a CNN-based deep learning method that uses 1D CNNs on reference tensors and 2D CNNs on read tensors, alternating between the axes of genome site and pileup reads. Max pooling is applied to the pileup axis, not the sequence axis due to the discrete nature of DNA sequence data. After several convolution layers, the spatial dimensions of the tensor are flattened to a single 1D vector that is merged with batch-normalized variant annotations [82], directed into fully connected layers and forwarded to the final softmax layer. Performance is improved by including a skip connection, concatenating normalized annotations with the penultimate dense layer or all deeper layers.
Clairvoyante [68] is a multitask five-layer CNN, including three convolution layers with varying core counts and two fully connected layers. Pooling is performed after each convolutional layer. It makes four sets of predictions for each input, with one set calculated from the first fully connected layer and the remaining sets calculated from the second fully connected layer, which are mutually exclusive.
NanoCaller [70] uses two CNNs, one for SNP detection and another for Indel detection. Both models have three convolutional layers with varying kernel sizes, which are then joined a Flatten layer and fully connected layers. The two models differ in how they calculate probabilities: the SNP model has two independent pathways for calculating base probabilities and zygosity probabilities, while the Indel model uses two fully connected hidden layers to determine probabilities for four zygosities scenarios. The final output is obtained by combining SNP and Indel network calls.
Clair [69] consists of a five-layer RNN with four tasks, including two bi-directional long short-term memory (Bi-LSTM) [84] layers and three fully connected layers. Each Bi-LSTM layer houses 256 cells, and the first fully connected layer carries out transposition and splitting. The second Bi-LSTM layer, along with the second and third fully connected layers, have certain dropout rates for information. Clair generates data for these four tasks and possesses an independent penultimate layer (i.e. the third fully connected layer) preceding each task output. This design ensures the independence of each task's output.
Clair3 [72] employs two networks: a pileup network with two different-sized Bi-LSTM layers and a full-alignment network based on the residual network (ResNet) with three standard residual blocks. The pileup network omits the transpose-split layer for enhanced speed. The full-alignment network incorporates a convolutional layer in each block for channel dimensionality. The spatial pyramid pool (SPP) layer [85], acting as a pooling layer, creates different receptive fields with three pooling scales per channel, providing a fixed-length output for the subsequent layer.
NanoSNP [73] comprises two networks: the pileup model network and the haplotype model network. The pileup model has two Bi-LSTM layers and fully connected layers, using tanh activation to extract tensor sequence for SNP prediction. Probabilities are computed using fully connected layers with softmax activation. The haplotype model, with a pileup image processing module and a haplotype image processing module, shares the same structure. Both models capture feature correlations at the SNP site, combining outputs through a fully connected layer for SNP prediction.
Methods for detecting SNVs and Indels in short read sequencing, such as DeepVariant and CNNScoreVariants, are applicable to WGS and WES data. Long read sequencing has two major platforms: PacBio and ONT. Clairvoyante, Clair and NanoCaller are able to analyze data generated by these two platforms. Although Clairvoyante and Clair can also process Illumina data, they perform better with data from the PacBio and ONT. Due to the higher error rate associated with ONT sequencing compared to PacBio, PEPPER-Margin-DeepVariant is specifically designed for analyzing data from PacBio. To address the high error rate of ONT, Clair3 and NanoSNP is specially developed for processing data from the ONT sequencing platform.
SV calling
Compared to SNP and Indel calling, SV calling is a more complex task, due to the variety of types of SVs and their inherent complexity. The following section will introduce several classical deep learning methods in SV detection, which are summarized in Table 2.
Model . | Variant type . | Fragment size . | Region . | Tensor encoding . | Training set . | Neural network . | Advantage & limitation . | Source code . | Year . |
---|---|---|---|---|---|---|---|---|---|
DeepSV | SV (deletion) | Short read | WGS | Image | Public | CNN | Call long deletions; work with noisy training data | https://github.com/CSuperlei/DeepSV | 2019 |
Samplot-ML | SV (deletion) | Short read | WGS | Image | Public | CNN | Call long deletions | https://github.com/mchowdh200/samplot-ml | 2020 |
TensorSV | SV (deletion, duplication, inversion) | Short read | WGS | Image | Public | CNN | Detect deletions, duplications and inversions; effective genotyping; quick training and inference speeds | https://github.com/timothyjamesbecker/TensorSV | 2020 |
DeepCNV | SV (CNV) | Microarray data and short read | WGS | Image | Synthetic | CNN + DNN | Detect CNVs; enhanced confidence; fewer false positives and failures in replicating associations; rely on the CNV callers to generate raw CNV calls | https://github.com/CAG-CNV/DeepCNV | 2021 |
DeepSVFilter | SV | Short read | WGS | Image | Pretraining | CNN | Filter SVs; employ transfer learning and data augmentation to deal with small datasets; cannot work on WES data; relatively small number of high confidence SVs used to construct the training set | https://github.com/yongzhuang/DeepSVFilter | 2021 |
BreakNet | SV (deletion) | Long read (PacBio) | WGS | Feature matrix | Public | CNN + Bi-LSTM | Detect deletions; stable performance on low coverage data | https://github.com/luojunwei/BreakNet | 2021 |
MAMnet | SV (deletion, insertion) | Long read (PacBio, ONT) | WGS | Feature matrix | Public | CNN + Bi-LSTM | Call insertions and deletions; improved performance on low coverage data | https://github.com/micahvista/MAMnet | 2022 |
svBreak | SV | Short read | WGS | Feature matrix | Synthetic | CNN | Detect 7 common SV breakpoints | https://github.com/BDanalysis/svBreak | 2022 |
DECoNT | SV (CNV) | Short read | WES | Problem formulation | public | Bi-LSTM | Enhanced call accuracy, reliable germline CNV detection on WES datasets; reliance on existing variation callers | https://github.com/ciceklab/DECoNT | 2022 |
CNV-espresso | SV (CNV) | Short read | WES | Image | Pretraining | CNN | Validation tool; detect rare CNV; cannot detect general CNV; cannot improve sensitivity | https://github.com/ShenLab/CNV-Espresso | 2022 |
SVision | SV (CSV) | Long read (PacBio, ONT) | WGS | Image | Synthetic | CNN | Detect both simple and complex SVs | https://github.com/xjtu-omics/SVision | 2022 |
Cue | SV (CSV) | Short read (extensible) | WGS | Image | Synthetic | Stacked hourglass network | Call and genotype both simple and complex SVs; learn complex SV abstractions directly from data; can be extended to different sequencing platforms | https://github.com/PopicLab/cue | 2023 |
Model . | Variant type . | Fragment size . | Region . | Tensor encoding . | Training set . | Neural network . | Advantage & limitation . | Source code . | Year . |
---|---|---|---|---|---|---|---|---|---|
DeepSV | SV (deletion) | Short read | WGS | Image | Public | CNN | Call long deletions; work with noisy training data | https://github.com/CSuperlei/DeepSV | 2019 |
Samplot-ML | SV (deletion) | Short read | WGS | Image | Public | CNN | Call long deletions | https://github.com/mchowdh200/samplot-ml | 2020 |
TensorSV | SV (deletion, duplication, inversion) | Short read | WGS | Image | Public | CNN | Detect deletions, duplications and inversions; effective genotyping; quick training and inference speeds | https://github.com/timothyjamesbecker/TensorSV | 2020 |
DeepCNV | SV (CNV) | Microarray data and short read | WGS | Image | Synthetic | CNN + DNN | Detect CNVs; enhanced confidence; fewer false positives and failures in replicating associations; rely on the CNV callers to generate raw CNV calls | https://github.com/CAG-CNV/DeepCNV | 2021 |
DeepSVFilter | SV | Short read | WGS | Image | Pretraining | CNN | Filter SVs; employ transfer learning and data augmentation to deal with small datasets; cannot work on WES data; relatively small number of high confidence SVs used to construct the training set | https://github.com/yongzhuang/DeepSVFilter | 2021 |
BreakNet | SV (deletion) | Long read (PacBio) | WGS | Feature matrix | Public | CNN + Bi-LSTM | Detect deletions; stable performance on low coverage data | https://github.com/luojunwei/BreakNet | 2021 |
MAMnet | SV (deletion, insertion) | Long read (PacBio, ONT) | WGS | Feature matrix | Public | CNN + Bi-LSTM | Call insertions and deletions; improved performance on low coverage data | https://github.com/micahvista/MAMnet | 2022 |
svBreak | SV | Short read | WGS | Feature matrix | Synthetic | CNN | Detect 7 common SV breakpoints | https://github.com/BDanalysis/svBreak | 2022 |
DECoNT | SV (CNV) | Short read | WES | Problem formulation | public | Bi-LSTM | Enhanced call accuracy, reliable germline CNV detection on WES datasets; reliance on existing variation callers | https://github.com/ciceklab/DECoNT | 2022 |
CNV-espresso | SV (CNV) | Short read | WES | Image | Pretraining | CNN | Validation tool; detect rare CNV; cannot detect general CNV; cannot improve sensitivity | https://github.com/ShenLab/CNV-Espresso | 2022 |
SVision | SV (CSV) | Long read (PacBio, ONT) | WGS | Image | Synthetic | CNN | Detect both simple and complex SVs | https://github.com/xjtu-omics/SVision | 2022 |
Cue | SV (CSV) | Short read (extensible) | WGS | Image | Synthetic | Stacked hourglass network | Call and genotype both simple and complex SVs; learn complex SV abstractions directly from data; can be extended to different sequencing platforms | https://github.com/PopicLab/cue | 2023 |
Model . | Variant type . | Fragment size . | Region . | Tensor encoding . | Training set . | Neural network . | Advantage & limitation . | Source code . | Year . |
---|---|---|---|---|---|---|---|---|---|
DeepSV | SV (deletion) | Short read | WGS | Image | Public | CNN | Call long deletions; work with noisy training data | https://github.com/CSuperlei/DeepSV | 2019 |
Samplot-ML | SV (deletion) | Short read | WGS | Image | Public | CNN | Call long deletions | https://github.com/mchowdh200/samplot-ml | 2020 |
TensorSV | SV (deletion, duplication, inversion) | Short read | WGS | Image | Public | CNN | Detect deletions, duplications and inversions; effective genotyping; quick training and inference speeds | https://github.com/timothyjamesbecker/TensorSV | 2020 |
DeepCNV | SV (CNV) | Microarray data and short read | WGS | Image | Synthetic | CNN + DNN | Detect CNVs; enhanced confidence; fewer false positives and failures in replicating associations; rely on the CNV callers to generate raw CNV calls | https://github.com/CAG-CNV/DeepCNV | 2021 |
DeepSVFilter | SV | Short read | WGS | Image | Pretraining | CNN | Filter SVs; employ transfer learning and data augmentation to deal with small datasets; cannot work on WES data; relatively small number of high confidence SVs used to construct the training set | https://github.com/yongzhuang/DeepSVFilter | 2021 |
BreakNet | SV (deletion) | Long read (PacBio) | WGS | Feature matrix | Public | CNN + Bi-LSTM | Detect deletions; stable performance on low coverage data | https://github.com/luojunwei/BreakNet | 2021 |
MAMnet | SV (deletion, insertion) | Long read (PacBio, ONT) | WGS | Feature matrix | Public | CNN + Bi-LSTM | Call insertions and deletions; improved performance on low coverage data | https://github.com/micahvista/MAMnet | 2022 |
svBreak | SV | Short read | WGS | Feature matrix | Synthetic | CNN | Detect 7 common SV breakpoints | https://github.com/BDanalysis/svBreak | 2022 |
DECoNT | SV (CNV) | Short read | WES | Problem formulation | public | Bi-LSTM | Enhanced call accuracy, reliable germline CNV detection on WES datasets; reliance on existing variation callers | https://github.com/ciceklab/DECoNT | 2022 |
CNV-espresso | SV (CNV) | Short read | WES | Image | Pretraining | CNN | Validation tool; detect rare CNV; cannot detect general CNV; cannot improve sensitivity | https://github.com/ShenLab/CNV-Espresso | 2022 |
SVision | SV (CSV) | Long read (PacBio, ONT) | WGS | Image | Synthetic | CNN | Detect both simple and complex SVs | https://github.com/xjtu-omics/SVision | 2022 |
Cue | SV (CSV) | Short read (extensible) | WGS | Image | Synthetic | Stacked hourglass network | Call and genotype both simple and complex SVs; learn complex SV abstractions directly from data; can be extended to different sequencing platforms | https://github.com/PopicLab/cue | 2023 |
Model . | Variant type . | Fragment size . | Region . | Tensor encoding . | Training set . | Neural network . | Advantage & limitation . | Source code . | Year . |
---|---|---|---|---|---|---|---|---|---|
DeepSV | SV (deletion) | Short read | WGS | Image | Public | CNN | Call long deletions; work with noisy training data | https://github.com/CSuperlei/DeepSV | 2019 |
Samplot-ML | SV (deletion) | Short read | WGS | Image | Public | CNN | Call long deletions | https://github.com/mchowdh200/samplot-ml | 2020 |
TensorSV | SV (deletion, duplication, inversion) | Short read | WGS | Image | Public | CNN | Detect deletions, duplications and inversions; effective genotyping; quick training and inference speeds | https://github.com/timothyjamesbecker/TensorSV | 2020 |
DeepCNV | SV (CNV) | Microarray data and short read | WGS | Image | Synthetic | CNN + DNN | Detect CNVs; enhanced confidence; fewer false positives and failures in replicating associations; rely on the CNV callers to generate raw CNV calls | https://github.com/CAG-CNV/DeepCNV | 2021 |
DeepSVFilter | SV | Short read | WGS | Image | Pretraining | CNN | Filter SVs; employ transfer learning and data augmentation to deal with small datasets; cannot work on WES data; relatively small number of high confidence SVs used to construct the training set | https://github.com/yongzhuang/DeepSVFilter | 2021 |
BreakNet | SV (deletion) | Long read (PacBio) | WGS | Feature matrix | Public | CNN + Bi-LSTM | Detect deletions; stable performance on low coverage data | https://github.com/luojunwei/BreakNet | 2021 |
MAMnet | SV (deletion, insertion) | Long read (PacBio, ONT) | WGS | Feature matrix | Public | CNN + Bi-LSTM | Call insertions and deletions; improved performance on low coverage data | https://github.com/micahvista/MAMnet | 2022 |
svBreak | SV | Short read | WGS | Feature matrix | Synthetic | CNN | Detect 7 common SV breakpoints | https://github.com/BDanalysis/svBreak | 2022 |
DECoNT | SV (CNV) | Short read | WES | Problem formulation | public | Bi-LSTM | Enhanced call accuracy, reliable germline CNV detection on WES datasets; reliance on existing variation callers | https://github.com/ciceklab/DECoNT | 2022 |
CNV-espresso | SV (CNV) | Short read | WES | Image | Pretraining | CNN | Validation tool; detect rare CNV; cannot detect general CNV; cannot improve sensitivity | https://github.com/ShenLab/CNV-Espresso | 2022 |
SVision | SV (CSV) | Long read (PacBio, ONT) | WGS | Image | Synthetic | CNN | Detect both simple and complex SVs | https://github.com/xjtu-omics/SVision | 2022 |
Cue | SV (CSV) | Short read (extensible) | WGS | Image | Synthetic | Stacked hourglass network | Call and genotype both simple and complex SVs; learn complex SV abstractions directly from data; can be extended to different sequencing platforms | https://github.com/PopicLab/cue | 2023 |
Tensor encoding
DeepSV [86] effectively utilizes a variety of information sources to identify long deletions in sequence data. It employs (R, G, B) image coding sequences, assigning each nucleotide (A, T, C or G) a basic color slightly modified to incorporate deletion signatures. The visualization process combines key features of deletions, including read depth, split read and discordant pair. Read depth is depicted through pileup images; split read and inconsistent read pairs are integrated by adjusting the base color of the mapped base according to the signature. Moreover, the color coding takes into account paired, concordant/discordant, mapping quality and mapping type to enhance the visualization.
Samplot-ML [87] is also a method for identifying long deletions. It uses Samplot [88] to generate images of deletions. Similar to DeepSV, this visualization incorporates read depth, discordant pairs and split read signals.
DeepCNV [89] utilizes image data and metadata for CNV detection, generating image files automatically with PennCNV’s auxiliary visualization program [90]. Each CNV call includes an LRR and a BAF scatter plot image [91] showing the potential CNV segment and its adjacent regions. The LRR plot shows SNP genotypes in the region, and the BAF plot covers the same region. DeepCNV differentiates SNPs by color-coding them. Pixel values of these images are normalized to a scale between 0 and 1. Additionally, PennCNV generates a summary of 13 features for quality checking, which are also normalized. This method by DeepCNV significantly improves the reliability of CNV calling, reducing false positive and failures in replicating CNV associations.
CNV-espresso [92] is another method used for detecting CNV, but only for rare CNV. It encodes the read depth signal of each candidate CNV into an image, where the X-axis represents the CNV coordinates in the human genome, and the Y-axis represents the normalized read depth value.
DeepSVFilter [93] filters SVs in short-read WGS data by encoding SV signals as images, treating SV filtering as a binary classification issue. It represents each SV breakpoint as a three-dimensional tensor image comprising read depth, split read and discordant read pair channels, with pixels covered by a read encoded as ‘255’ and others as ‘0’. If an image covers two breakpoints of a single SV, separate images are generated and then vertically spliced for a complete SV image. Post-training, DeepSVFilter can filter any SV call sets from any detection methods.
BreakNet [94] is designed for identifying deletions within long reads. It divides the reference into various subregions and, based on alignment data, creates a feature matrix for each subregion. In this matrix, each row corresponds to an aligned long read, and each column signifies whether a deletion exists at that specific position. BreakNet organizes the rows in order of deletion count, choosing the top n rows to generate the matrix. In instances where the rows are fewer than n, the remaining elements default to 0. Despite delivering consistent performance on data with low coverage, BreakNet has a limitation. It can only detect deletions due to the constraints of the extracted features.
MAMnet [95] works to detect genome insertions and deletions by comparing long reads to the reference genome. Specifically, the reference genome is split into contiguous subregions, to which the reads are then aligned. The average read coverage is calculated for selected subregions, and the reads that overlap with each subregion are collected. MAMnet proceeds to compute nine features for each base position and averages them. A signature matrix is constructed for every subregion, followed by a logarithmic transformation to stabilize the deep neural network (DNN) training.
svBreak [96] is employed to identify prevalent types of SV breakpoints. It gleans 12 SV-related features for each genome site from the sequencing reads that are aligned to the reference genome. Each of these features is assigned a value of 1, 0 or −1, indicative of the positive, negative or uncertain status of each genome site. Following this, svBreak constructs a data matrix where each row corresponds to a site and each column signifies a feature. This approach enables the simultaneous calling and distinguishing of seven common SV breakpoints.
SVision [97] is a deep learning–based multi-object recognition framework that detects and identifies complex structural variants in sequencing data, which are often overlooked due to multiple breakpoints. The encoder takes variant feature sequences (VAR) and encodes dissimilarities and similarities between mutation-supporting read and reference genomes (REF) into images with three channels representing matched, duplicated and inverted segments. For each pair of VAR and REF, the encoder identifies matched and unmatched bases to create VAR-to-REF and REF-to-REF images. Variant features are highlighted by eliminating background noise through subtracting the REF-to-REF image from the corresponding VAR-to-REF image, resulting in a denoised image for each variant.
Cue [98] is capable of detecting and genotyping deletions, duplications, inversions, inverted duplications and inversions flanked by deletions that exceed 5 kb in length, with the latter two classified as complex SVs. It learns complex mutation patterns directly from the data, transforming sequences into images that encode SV information signals. In this process, Cue converts read sequences into images that capture multiple sequence signals between two genome intervals. Utilizing a pre-trained neural network, it generates Gaussian response confidence maps for each image, which encode the site, type and genotype of SVs in the image. Subsequently, it refines the high-confidence SV predictions and reassigns them to genome coordinates from the images.
Training set
Training data is vital for machine learning, with quality data leading to better performance. However, SV calling is challenging due to lack of training data and the absence of a universally accepted gold standard, necessitating the creation of SV training sets. This can be achieved through three main methods: using publicly available datasets, manually creating artificial data and using pretraining methodologies.
Publicly accessible datasets are commonly used for training sets, but their scale and heterogeneity may not cover all SV types and variations. Several algorithms like DeepSV [86], DECoNT [99], Samplot-ML [87], TensorSV [100], BreakNet [94] and MAMnet [95] utilize this method. DeepSV, DECoNT and Samplot-ML use the 1000 Genomes Project dataset [1]. Specifically, TensorSV uses three different datasets including the 1000 Genomes Project and the Human Genome SV project datasets [101], BreakNet uses multiple read alignment files from four extensively researched individuals and MAMnet uses six datasets from varying sequencing technologies.
Generating synthetic data is another method, but it may not fully represent real-world SV diversity. Algorithms like SVision [97], svBreak [96], DeepCNV [89] and Cue [98] use this method. SVision trains its CNN model on a mix of real and simulated SVs to ensure a balanced representation of SV types. VISOR [102] supplements training data for inversions, duplications and tandem duplications. svBreak uses simulated SV breakpoints for training. DeepCNV collects a dataset of SVs, including CNVs, from the WGS data generated by other CNV callers [6]. Raw CNV calling files are obtained from the 1000 Genomes FTP site, with false positives filtered out.
Pretraining methods involve initially training a model with a large dataset to learn general SV features, followed by fine-tuning it with a smaller specific dataset with limited SV annotations. This solves the lack of training data and improves accuracy. DeepSVFilter [93] is an example of this method, starting by identifying SVs from selected samples’ short WGS data using advanced SV calling methods [44–46] and then merging all the SVs into a unified set. CNV-espresso [92] is another example, building its training set from offspring–parents trio exome sequencing data.
In summary, building a comprehensive and representative training set for deep learning–based SV analysis requires careful evaluation of the available data and the diversity of SV types. Therefore, a combination of these methods may be necessary to achieve optimal performance in practical applications.
Network architecture
Neural networks for SV detection typically use CNN, RNN (Bi-LSTM) or a combination of both. The CNN effectively handles spatial structure information by extracting local features through convolutional and pooling layers and combining them via fully connected layers. It automatically captures local patterns and global characteristics, benefiting image classification tasks. Bi-LSTM captures temporal dependencies by considering both preceding and succeeding information, which is advantageous for modelling genomic sequence features. Combining the CNN and Bi-LSTM can extract spatial and temporal patterns comprehensively, improving classification performance. The same approach applies to encoding genomic sequence information as feature matrices.
svBreak [96] employs seven CNN models, each serving as a binary classification model, to detect various genomic variations. The network comprises five convolutional layers and three fully connected layers, with the CNN models extracting features and examining local data calculations. A weighted non-linear mapping is applied to the output matrix by the activation layer using the rectified linear unit (ReLU) function [103], with a 2 × 2 pooling layer size for better data processing. Lastly, fully connected layers are used in the output layer to minimize feature information loss, ensuring accurate detection of genomic variations.
DECoNT [99] is a comprehensive neural network that predicts precise CNVs and their categories, including deletion, duplication or no call, using a Bi-LSTM structure with 128 hidden neurons in each direction to analyze read depth signals. It includes a batch normalization layer and two fully connected layers, receiving inputs from the Bi-LSTM and previous CNV predictions, represented as one-hot-encoded vectors. The first fully connected layer contains 100 neurons, activated by the ReLU function, while the output layer consists of three neurons using softmax activation for event probability calculation. A weighted cross-entropy loss function optimizes the network for precise CNV prediction, mirroring the process used for categorized CNV prediction.
BreakNet [94] uses CNN and LSTM [104] networks to analyze genomic deletions. The CNN module downsamples the input matrix from the average pooling layer to lessen computational load and applies six convolution blocks, each with a conv2D layer, an SE optimization layer and a max pooling layer using ReLU activation function for non-linearity. The BRNN module has two Bi-LSTM layers with 64 LSTM units each, processing feature vectors from the CNN module and capture the information in both directions. Two fully connected layers categorize the vectors from the BRNN module, with a dropout applied after each layer to enhance generalization. The final outputs are computed using the sigmoid function.
MAMnet [95] combines CNN and LSTM models to detect genomic variations, processing the variation feature matrix as a single time step and transforming it into a vector using CNN. This reuses conv blocks, each comparing a convolution layer, a max pooling layer, a batch normalization layer and a squeeze-and-excitation (SE) optimization layer [105]. A Bi-LSTM network then captures temporal information across multiple steps in both directions. Ultimately, three fully connected layers integrate the information to produce the final prediction. The first two fully connected layers are followed by a dropout layer, while the last fully connected layer uses a sigmoid activation function.
DeepCNV [89] uses the CNN and DNN to process image data and metadata, respectively. The CNN part has a chain of convolutional layers with 3 × 3 receptive fields and fixed 1 pixel stride filters using LeakyReLU [106] activation function and 2 × 2 max pooling. The metadata are modeled through a four-layer DNN. The output from both two branches is combined and passed into a 50 neurons fully connected layer, then a terminal sigmoid activation node. This final neuron generates a score indicating false- or true-positive samples. The model is trained with the RMSprop optimizer.
DeepSVFilter [93] applies transfer learning using pre-trained CNN models on the ImageNet dataset for classification [107–112]. Its architecture involves a dropout layer after the pre-trained layers, a fully connected layer and a softmax layer. It employs the Adam optimizer for training SV images, limits epochs to prevent overfitting and uses cross-entropy loss function to compare the predicted probability with the true category label. The trained CNN processes candidate SV images and assigns a score ranging from 0 to 1 to each SV, indicating the likelihood of a true SV.
SVision [97] is based on the AlexNet architecture [55] for classifying sequence differences in similarity images, which includes five convolutional layers and three fully connected layers. The first layer processes images with special dimensions. It trains the CNN using transfer learning, initializes parameters with the best parameter set from ImageNet competition and fine-tunes on training data. It uses the cross-entropy loss function and the extracted features during training can be used for traditional machine learning classification.
Cue [98] builds on the fourth-order stacked hourglass [113, 114] CNN, originally used for human pose estimation, and aims to consolidate information at multiple scales. It starts with a convolutional backbone module, followed by four hourglass modules, each comprising a residual module and max-pooling layers for downscaling, as well as upsampling layers and skip connections to restore the output resolution. After each hourglass module, Cue performs intermediate supervision to generate intermediate confidence map predictions and calculate loss. This enables iterative re-evaluate of estimates and features at every scale.
The types of SVs that can be detected by different methods vary significantly, for instance, DeepSV is specifically designed for detecting long deletions, while svBreak can identify seven common SV breakpoints. Methods for detecting SVs in short read are generally applicable to WGS data, with only a few suitable for WES data, such as DECoNT and CNV-espresso. Additionally, some methods like Cue can independently perform variant calling and variant type classification. Other methods, like DeepSVFilter and CNV-espresso, require integration with precursor variant calling software, such as Manta [44], Delly [45] or Lumpy [46].
CONCLUSION
The field of genomics is experiencing massive data growth in biomedical and computational biology. Analytical methods have shifted from basic statistics to deep learning algorithms in machine learning, making the process more data-driven. This paper reviews 20 methods of genome variation detection based on deep learning. Specifically, we divide these methods into small variation and structural variation detection, focusing on tensor encoding techniques, training sets and NN models. Deep learning methods represent genetic variations as images, transforming variant calling problems into image classification problems. This approach potentially outperforms traditional methods and represents a promising direction for general SV discovery. These methods, while primarily focusing on genetic variation detection, could also be used for somatic variation detection with appropriate training sets.
It's worth noting that each method has advantages and limitations, and users can choose the appropriate method based on the type of variation to be detected and the size of the data fragment. Variation types are mainly divided into small variation and SV. Small variation includes SNV and Indel, with most methods able to detect both SNV and Indel, whereas NanoSNP is primarily used for identifying SNV. SV includes duplication, deletion, insertion, inversion and other complex structural variation, among which duplication and deletion are collectively called CNV. For instance, DeepSV focuses on detecting deletion, DeepCNV is used for detecting CNV and SVision is for detecting CSV. Data fragment size is categorized into short read and long read. Short read data should be further distinguished between WGS and WES; for example, DeepSVFilter is used for WGS data and DECoNT for WES data. Small variation detection does not require differentiation, such as DeepVariant and CNNScoreVariants. Long read data need to be distinguished between being produced by PacBio or ONT platforms; for instance, BreakNet is mainly used for PacBio data, Clair is primarily for ONT data and MAMnet is suitable for data from both platforms. When multiple methods are available for the same variant type and data type, users can choose according to the order in the table we provide. Typically, methods listed later in the table are improvements on earlier ones; for example, Clair is a successor of Clairvoyante, and NanoCaller is a further improvement of Clair. Additionally, some methods can be used independently, such as DeepVariant and Cue, while others need to be integrated with other tools, like DeepSVFilter, CNV-espresso and Samplot-ML. We recommend that users choose the method that suits the type of variation and data they need to detect, referring to our table for guidance. If users have sufficient data, they may consider training models with their own data; if data are insufficient, it is advisable to use well-trained models.
Future research may focus on creating scalable tools that can manage complex variants, adapt to new technologies and integrate various sequencing techniques. Current deep learning methods struggle with accurately detecting both short and long reads at the same time, so further work should aim to support multiple sequencing platforms. Additionally, more features in alignment information should be explored to improve variant calling accuracy. Accurately identifying multiple types of SV remains difficult as no single method can precisely identify all types of genomic variations and sequencing data. Therefore, research should aim to develop deep learning methods that can precisely detect different types of SVs, while taking into account the unique characteristics of each sequencing data type.
This article reviews the latest progress of deep learning methods in genome variant calling, including SNV and INDEL calling and SV calling.
We explore tensor encoding methods, which mainly include image-based and multi-dimensional tensor-based methods.
We analyze the datasets used for variant calling, with small variation detection typically using gold-standard datasets, while structural variation detection datasets often require some form of processing, including synthetic methods and pre-training methods.
We study neural network models, which mainly include three types: CNN-based models, RNN-based models and CNN-RNN hybrid models.
FUNDING
This work was supported by the National Nature Science Foundation of China (Project 62072140) and Heilongjiang Provincial Science and Technology Department (No: 2022ZXJ03C01).
Author Biographies
Ren Junjun is a PhD student at the Harbin Institute of Technology, School of Computer Science and Technology, Harbin, China.
Zhang Zhengqian is a PhD student at the Harbin Institute of Technology, School of Computer Science and Technology, Harbin, China.
Wu Ying is a Master’s student at the Harbin Institute of Technology, School of Computer Science and Technology, Harbin, China.
Wang Jialiang is a Master’s student at the Harbin Institute of Technology, School of Computer Science and Technology, Harbin, China.
Liu Yongzhuang is an associate professor at the Harbin Institute of Technology, School of Computer Science and Technology, Harbin, China.