Abstract

Nanopore sequence technology has demonstrated a longer read length and enabled to potentially address the limitations of short-read sequencing including long-range haplotype phasing and accurate variant calling. However, there is still room for improvement in terms of the performance of single nucleotide variant (SNV) identification and computing resource usage for the state-of-the-art approaches. In this work, we introduce miniSNV, a lightweight SNV calling algorithm that simultaneously achieves high performance and yield. miniSNV utilizes known common variants in populations as variation backgrounds and leverages read pileup, read-based phasing, and consensus generation to identify and genotype SNVs for Oxford Nanopore Technologies (ONT) long reads. Benchmarks on real and simulated ONT data under various error profiles demonstrate that miniSNV has superior sensitivity and comparable accuracy on SNV detection and runs faster with outstanding scalability and lower memory than most state-of-the-art variant callers. miniSNV is available from https://github.com/CuiMiao-HIT/miniSNV.

Introduction

Single nucleotide variants (SNVs) are the most abundant form of genomic variation which play a pivotal role in various biological processes and many diseases, and are essential to deciphering the relationship between genetic composition and specific diseases [1–3]. Most existing SNV calling methods are optimized for short reads, which encounter mapping difficulties in repetitive regions and result in undetected SNVs [4–7]. The emergence of long-read sequencing technologies, such as Oxford Nanopore Technologies (ONT) [8] and Pacific Biosciences (PacBio) [9], has facilitated more comprehensive analysis due to the longer sequencing reads [10].

ONT is a nanopore-based and high-throughput sequencing platform that can generate reads greater than 100 kb [11, 12], which can easily span and map to repetitive genomic regions. However, the base accuracy of the nanopore reads is relatively lower than that of PacBio HiFi [13] and short reads mainly due to the higher and systematic base-level error characteristic [14]. The accuracy of ONT reads continually improves with enhancements in nanopores, sequencing chemistry, and basecalling algorithms. The latest released experimental data of ONT generated by R10.4.1 nanopores, kit v14 motor protein, and basecalled using a bespoke Dorado [15] model was announced to yield a median accuracy of Q26.4 [16]. Under these circumstances, numerous efforts are underway to improve the identification of small variants, which generally fall into two main categories: deep learning (DL) and statistical-based.

Over the years, DL methods have dominated the era of long-read-based variant calling [17], which had two major designs involving pileup or full-alignment [18]. Pileup-based methods summarized the read alignments into features before input into a variant-calling network, and full-alignment-based methods kept spatial information in the read alignments. Up to now, there are several DL-based tools widely used for genomic variant calling. DeepVariant [19] tried DL for the first time and PEPPER-Margin-DeepVariant [6] integrated full-alignment-based PEPPER, Margin, and DeepVariant to enhance the accuracy of variant calling. Nanocaller [20] is pileup-based and uses a three-layer convolutional neural network architecture. Clair3 [17] first tried combining pileup and full-alignment. Both PEPPER-Margin-DeepVariant and Clair3 leveraged the haplotype information to improve variant identification [21–24]. It is worth noting that Clair3 is officially recommended and is integrated into the official workflow for genomic analysis [25]. As we all know, there are relatively few statistical-based methods used for long-read variant identification because sequencing errors bring too many candidate variants to the heuristic-based candidate finder, limiting the computational time and accuracy of detection. Longshot [5] is a diploid SNV calling method, utilizing a pair hidden Markov Model to average over the uncertainty in the local read alignments and finally determines phased genotypes by calculating genotype likelihoods and performing haplotype assembly using HapCUT2 [23].

Despite the limitations of statistical-based methods, several advantages cannot be ignored compared to DL methods. Firstly, statistical-based methods generally do not require a large number of labeled data for model training. This is especially important in studies of new species or minority population samples, which lack sufficient prior knowledge and training data. Secondly, statistical-based methods are designed with generalization in mind, allowing them to adapt well to different species and sequencing data. These methods are usually based on widely applicable mathematical principles and assumptions, rather than specific data distribution characteristics. Thirdly, many statistical learning methods are more computationally lightweight and faster compared to DL models, which usually require significant computational resources and time for training.

Herein, we introduce miniSNV, a novel SNV calling method optimized for ONT long reads. miniSNV leverages read pileup, read-based phasing, and consensus generation to discover and genotype SNVs. miniSNV also utilizes the known common variants in populations to swiftly pinpoint high-quality variants. The results demonstrate that miniSNV has higher or competitive F1-scores for SNV calling compared to the state-of-the-art approaches. Moreover, miniSNV is most memory-efficient and offers the fastest processing speeds and scalability. Given these advantages, we believe miniSNV has the potential to become a pivotal tool in upcoming genomic studies.

Materials and methods

Using sorted BAM files provided by commonly used read aligners such as Minimap2 [26] as input, miniSNV applies several strategies including read pileup, read-based phasing, and consensus generation to detect and genotype SNVs. The approach has four major steps shown in Fig. 1.

  1. miniSNV applies read pileup to recognize the candidate loci with divergences between reads and reference for variant calling. The candidate loci are divided into two categories, i.e. high- and low-quality loci, relying on the prebuilt known common variants and the complexity of the signatures (such as density, spatial distance, and so on) around the corresponding loci.

  2. miniSNV assigns the genotypes for high-quality loci by comparing the likelihoods of possible genotypes using a binomial model and uses WhatsHap [27] to phase all the heterozygotes and haplotag variants by the raw reads to generate haplotype-specific phased alignment.

  3. miniSNV inputs the phased alignments and collects two sets of reads having a haplotag of 1 and 2 indicating which haplotype the read inherits from respectively. For an active region centered on a low-quality locus, miniSNV extracts all the overlapped reads and employs multiple sequence alignment (MSA) or local assembly to generate consensus sequences of the involved read parts on different read sets.

  4. miniSNV aligns the generated consensus sequence against the local reference sequence of the candidate region and identifies alternative alleles from the realigned information. Finally, miniSNV applies a haplotype-aware genotyping pipeline for variant calling and filtering on all low-quality loci.

The workflow of miniSNV approach. (A) Using read pileup to recognize the candidate loci and divide the loci into high- and low-quality. (B) Using read-based phasing and consensus generation to detect and genotype SNVs (HP1: Haplotype 1; HP2: Haplotype 2).
Figure 1

The workflow of miniSNV approach. (A) Using read pileup to recognize the candidate loci and divide the loci into high- and low-quality. (B) Using read-based phasing and consensus generation to detect and genotype SNVs (HP1: Haplotype 1; HP2: Haplotype 2).

Known common variants collection

miniSNV can rely on the prior knowledge of known variants from a population as additional information to accelerate the identification of variants with high frequency in a population. miniSNV treats loci inside and outside the known common variants in populations with different strategies (more details refer to the methods applied for high- and low-quality candidate loci). miniSNV takes population VCF format as input (1000 genome project in default) and calculates the allele frequency for each alternative allele. In practice, miniSNV organizes the reference genome by a series of windows with fixed length (default: L = 300 bp) and any two adjacent windows have half-length overlap. Multiallelic loci are split into several biallelic loci and any alternative allele with a frequency larger than 5% is recorded with a unique identity (UID) composed by window ID, position, reference base, and alternative base (abbreviated as wid_pos_ref_alt). The window ID is calculated using the formula: |$pos\_ in\_ genome\times 2/L$|⁠, where |$pos\_ in\_ genome$| is the accumulated position in the reference genome across all chromosomes.

Candidate loci generation and classification

miniSNV first filters out all the read alignments with low mapping quality (default MAPQ <20), supplementary alignments, secondary alignments, and unmapped reads. Then, miniSNV applies read pileup by extracting the detailed information from alignment successively and statistic the cumulative count of different alleles (‘A, C, G, T’) in a genome position. Subsequently, compare the ratio |${R}_{alt}={N}_{alt}/{N}_{total}$| and a user-defined threshold (default 0.2) at a genome position pos, where |${N}_{alt}$|is the count of an alternative allele and |${N}_{total}$| is the total number of valid alignments. The position with the specific alternative allele (identified by the UID) is marked as a candidate locus if |${R}_{alt}$| is larger than the threshold. Each alternative allele is considered separately.

miniSNV searches for the UID of each candidate locus in the known common variants in populations. If the locus is found and |${R}_{\mathrm{alt}}$|>0.3, it is classified as high-quality. If the locus is not found but meets the following two conditions, it also qualifies as high-quality: (i) |${R}_{\mathrm{alt}}$|>0.7; (ii) no other candidate loci within 10 base pairs upstream and downstream. The combination of |${R}_{\mathrm{alt}}$| threshold of 0.3–0.7 was selected according to a series of in-house experiments shown in Supplementary Table S1 available online at http://bib.oxfordjournals.org/. All remaining loci are classified as low-quality. It is worth noting that, long reads often suffer from distinguishing the real number of repeat bases in some low-complexity regions including simple repeat (homopolymer) and tandem repeat due to the systemic random sequencing insertion and deletion errors. Simple statistical methods cannot depict the variant accurately usually, and thus, any candidate locus located in the low-complexity region is classified as low-quality by introducing the corresponding annotations.

Genotyping for high-quality candidate loci

miniSNV employs a biallelic assumption to assign genotypes for the high-quality loci which compares the likelihoods of three possible genotypes computed by a binomial model based on the supporting read evidence and chooses the largest one as the corresponding genotype. The model is defined as:

(1)

where |$\theta$| is the probability that a read is mapped to a given zygosity erroneously (default value, 0.03 for SNV), which is obtained empirically and independently between all observations. |${N}_{ref}$|and |${N}_{alt}$|are the number of supporting reads for reference and alternative alleles, respectively.|$p$| is the prior probability for a heterozygous genotype, and herein the prior distribution is configured as equal for all three genotypes. miniSNV normalizes the likelihoods by computing the log sum of exponentials and produces the phred-scaled variant calling quality score for each high-quality SNV.

Phasing and haplotagging using high-quality variants

Long-read sequencing offers significant benefits in read-based phasing, leveraging additional information beyond genotypes to ascertain allele relationships across multiple variant loci. It identifies heterozygote variants from homologous chromosomes and annotates raw reads with specific variants, enhancing signal accuracy within individual haplotypes and mitigating systematic errors. miniSNV initially selects heterozygote variants with quality scores exceeding 15 from a high-quality variant set. Then miniSNV utilizes WhatsHap (version 2.0) to phase and haplotag the variants, generating a haplotype tag list that correlates alignments with haplotypes. Reads are grouped into three phasing sets based on distinct tags {0, 1, 2}, with sets 1 and 2 being deemed valid if their corresponding read counts exceed a user-defined threshold (default 2). Reads within valid sets are utilized to generate a consensus sequence and identify variants in low-quality loci. Notably, sets with tag 0 are unphased and considered invalid. In cases where all three sets are invalid, all reads contribute to generating two consensus sequences representing potential haplotypes.

Haplotype generation and genotyping for low-quality candidate loci

miniSNV also genotypes the low-quality loci based on the original supporting reads of reference and alternative alleles (termed ‘preGT’). The accuracy of ‘preGT’ is influenced by systematic errors, dense variants, or fewer signal features in local areas. For an active region which centers on a low-quality locus and extends 100 bp to both upstream and downstream, miniSNV further applies local assembly or MSA to generate consensus sequences (termed local haplotypes) from haplotype-specific phased alignment to eliminate false positive. The MSA-based model is used if more than 70% of all reads are spanning reads. Otherwise, the assembly-based model is used.

If the MSA-based model is selected, miniSNV extracts segments from all the spanning reads and overlapped reads with an overlapping ratio over 0.6 in a specific active region, and then employs an external MSA tool to reconstruct at most two candidate haplotypes of the active region. We tried several state-of-the-art MSA tool such as abPOA [28], FMAlign [29], HAlign [30] on quite a number of local genomic regions, and found that abPOA is most suited to the task. FMAalign and HAalign also showed good ability and performance, however, they seem not support to produce multiple consensus sequences at heterozygous sites. Thus, abPOA is employed in the current version of miniSNV.

If the assembly-based model is selected, all the segments from overlapped reads exclude reads with an overlapping ratio of less than 0.4 are extracted, and the local de novo assembly is run using a de Bruijn graph approach similar to that described for Manta [31]. The length of the k-mers is incrementally increased from 31 to 121, in 10 steps, to manage and avoid loops in the de Bruijn graph. Contigs generated in the previous iteration are used as pseudo-reads for the generation of contigs in the next round. At most two consensus sequences are derived from the de Bruijn graph, prioritizing variants that have the highest degree of read support, and score variants. The MSA-based assembly model only generates one consensus sequence for a valid phasing set which implies that the sequencing data has yielded sufficient information to determine the phase of alleles that are located on the same chromosome. If no valid phasing set exists in the previous stage, two consensus sequences from all reads will be generated as local haplotypes to distinguish the origination of variants.

Haplotype generation for an active region is considered unsuccessful if the MSA or assembly procedure is unable to generate at least one non-reference haplotype, which also means that no variants will be called in that active region. After filtering haplotypes with less than three supporting reads, the remaining are aligned to the local reference part using KSW2 [26, 32]. Alternative alleles found within 10 bp of the low-quality candidate locus are annotated as ‘discovered’. miniSNV determines the genotype of the discovered allele depending on the number of times it appears on different haplotypes. The genotype is assigned ‘0/0’, ‘0/1’, ‘1/1’ if the number of appearances is 0, 1, 2, respectively. The genotypes assigned in this stage are termed as ‘hapGT’.

In general, miniSNV determines the final genotypes according to hapGT, but there is one case that needs special explanations. If preGT is ‘0/1’ while hapGT is ‘1/1’, the ‘0/1’ represents that all the reads containing alternative alleles should be located in one haplotype (Paternal or Maternal). But ‘1/1’ shows that reads containing alternative alleles are randomly scattered in two different haplotypes that may be caused by systematic errors and should filter out this variant to avoid false positives. Once determined, the genotypes are assigned based on the supporting reads of reference and alternative alleles obtained by an MSA-based or assembly-based model.

Results

We applied real and simulated sequencing data to evaluate the performance of miniSNV and other two state-of-the-art callers, a statistical-based caller Longshot (version 0.4.5) and a DL-based caller Clair3 (version 1.0.4) that Oxford Nanopore officially recommended to identify SNVs. The real Nanopore sequencing data were generated from well-studied GIAB trio samples (HG002, HG003, and HG004) under different sequencing chemistry and basecallers. Minimap2 (version 2.24-r1122) was used to align the real sequencing data with optimized parameters (−x map-ont). miniSNV and Longshot utilized default setting to identify SNVs, while the latest model trained by ONT (https://github.com/nanoporetech/rerio/tree/master/clair3_models) was adopted for Clair3. Several metrics involving precision, recall, F1-score for individual and Mendelian Consistency Rate (MCR) and the Homozygous Discovery Rate (HDR) for trio are detailed in Supplementary Notes (‘Evaluation Metrics’ section).

Results on GIAB Ashkenazi trio ONT data using kit v14 and basecalled with dorado

We benchmarked miniSNV, Clair3, and Longshot on the GIAB Ashkenazi Trio that were sequenced with two PromethION flow cells using the ligation sequencing kit v14 chemistry and R10.4.1 flowcells around 200 Gbases per sample. Dorado with sup (sup-) and hac (hac-) mode was used to basecall and generate mean 70x sequencing coverage datasets for HG002, HG003 and HG004, respectively. The high-confidence SNV callsets of the trio made by GIAB were employed as the ground truth. The results are shown in Fig. 2.

Benchmarking results on GIAB Ashkenazi trio ONT data.(A) F1-score, (B) precision, recall, and (C) MCR, HDR-M for HG004 and HDR-F for HG003 of different callers on sup-datasets. (D) F1-score, (E) precision, recall, and (F) MCR and HDR of different callers on hac-datasets.
Figure 2

Benchmarking results on GIAB Ashkenazi trio ONT data.(A) F1-score, (B) precision, recall, and (C) MCR, HDR-M for HG004 and HDR-F for HG003 of different callers on sup-datasets. (D) F1-score, (E) precision, recall, and (F) MCR and HDR of different callers on hac-datasets.

miniSNV simultaneously achieved the highest F1-score and recall, all of which were greater than 99.7%, and closer precision on HG002, HG003, and HG004 sup-datasets (Fig. 2A and2B, Supplementary Table S2 available online at http://bib.oxfordjournals.org/). Three SNVs called from sup-datasets that could be only detected by miniSNV were displayed in Supplementary Fig. S1 available online at http://bib.oxfordjournals.org/. All the SNVs were situated in relatively complex genomic regions characterized by numerous closely related variant candidates and inconsistent read alignments around short tandem repeat. For the hac-datasets, miniSNV outperformed Clair3 on HG003 (miniSNV:0.9971 and Clair3:0.9970) hac-dataset due to its higher recall, but inferior to Clair3 on HG002 (miniSNV:0.9971 and Clair3:0.9979) and HG004 (miniSNV:0.9974 and Clair3:0.9980) hac-datasets for F1-score (Fig. 2D and2E, Supplementary Table S3 available online at http://bib.oxfordjournals.org/). The performance of Longshot was significantly lower than miniSNV and Clair3 for different metrics especially the <99% recalls on all datasets, which represented that Longshot would omit a certain number of variants. In addition, recalls on HG003 sup-dataset and hac-dataset were also lower compared with HG002 and HG004. Still, miniSNV had relatively consistent precisions and recalls across different samples and technologies, which demonstrated the stable calling performance of miniSNV.

We also assessed the Mendelian laws of inheritance among trios including MCR and HDR. miniSNV and Longshot exhibited higher MCRs than Clair3 by ~3% in both sup-datasets (Fig. 2C, Supplementary Table S4 available online at http://bib.oxfordjournals.org/) and hac-datasets (Fig. 2F, Supplementary Table S5 available online at http://bib.oxfordjournals.org/). And as for HDR, miniSNV surpassed Clair3 and Longshot by 1.8% and 1.6% in sup-dataset, and by 3.4% and 1.8% in hac-dataset for the paternal sample (HG003). For the maternal sample (HG004), Longshot led slightly ahead of miniSNV, with Clair3 close behind. Compared with Clair3 and Longshot, miniSNV demonstrates remarkable stability and high performance in both MCR and HDR, which represents the quality and reliability of variant calling through family sample analysis, and makes it a robust choice for genomic studies focused on trios.

Results on HG002 ONT data basecalled with guppy, and the comparison

We analyzed additional real sequencing data from GIAB sample HG002, sequenced via simplex nanopore with ligation sequencing kit v12 and R10.4 flowcells (FLO-PRO112), to evaluate variant callers across different sequencing elements like flowcell type, chemistry, and base calling methods. The variant calling outcomes for HG002 ONT data which basecalled with Guppy 5.0.15 using kit v12 (k12-data) and kit v14 (k14-data) are displayed in Fig. 3 and Fig. 4.

Benchmarking results on HG002 with k12- and k14-data. Precision, recall, and F1-score of different callers on (A) k12-data and (B) k14-data. F1-score of different callers under different sequencing coverages ranges from 30× to 70× on (C) k12-data and (D) k14-data.
Figure 3

Benchmarking results on HG002 with k12- and k14-data. Precision, recall, and F1-score of different callers on (A) k12-data and (B) k14-data. F1-score of different callers under different sequencing coverages ranges from 30× to 70× on (C) k12-data and (D) k14-data.

Benchmarking results on HG002 with k12- and k14-data outside high-confidence regions. Comparison of (A) precision and (B) recall outside high-confidence regions between miniSNV and Clair3 on k12-data. Comparison of (C) precision and (D) recall outside high-confidence regions between miniSNV and Clair3 on k14-data.
Figure 4

Benchmarking results on HG002 with k12- and k14-data outside high-confidence regions. Comparison of (A) precision and (B) recall outside high-confidence regions between miniSNV and Clair3 on k12-data. Comparison of (C) precision and (D) recall outside high-confidence regions between miniSNV and Clair3 on k14-data.

miniSNV showed some advantages on k12-data with 0.02%, 0.62%, 0.33% and 0.05%, 1.83%, 0.96% improvements on precision, recall and F1-score than Clair3 and Longshot, respectively (Fig. 3A, Supplementary Table S6 available online at http://bib.oxfordjournals.org/). Only miniSNV obtained the F1-score over 99% on k12-data. By comparing the results between k12-data and k14-data, k14-data enhanced the recalls by 1.02%, 1.64%, and 1.77% for miniSNV, Clair3, and Longshot, while slightly improving the precisions (Fig. 3B, Supplementary Table S2 available online at http://bib.oxfordjournals.org/). The comparison also demonstrated that miniSNV could still achieve satisfying results even with the previous chemistry version with high sequencing errors.

We further randomly down-sampled the datasets to 30×, 40×, 50×, and 60× to assess the ability of the SNV callers on different coverage datasets. The benchmarking results at coverages from 30× to 70× of k12-data and k14-data are shown in Fig. 3C (Supplementary Table S7 available online at http://bib.oxfordjournals.org/) and Fig. 3D (Supplementary Table S8 available online at http://bib.oxfordjournals.org/). On the k12-data, miniSNV achieved the best F1-scores on all the datasets with different coverage, especially for 50× and 60× datasets. For the k14-data, the difference between miniSNV and Clair3 was slight, with miniSNV showing better results than Clair3 on the dataset with a coverage depth exceeding 60×. It was worth noting that, the improvement in data quality contributed to an ~1% increase in the F1-score for all three callers across datasets with different coverage depths.

We compared variant-calling performance in difficult-to-map regions and low-complexity regions of the genome which defined by GIAB v3.3 stratifications, and the results on HG002 with k12-data (Fig. 4A and4B, Supplementary Table S9 available online at http://bib.oxfordjournals.org/) and k14-data (Fig. 4C and4D, Supplementary Table S10 available online at http://bib.oxfordjournals.org/) under these regions are shown. No significant precision differences between miniSNV and Clair3 were noted in regions outside high-confidence areas, such as segmental duplications and low-mappability regions. However, miniSNV excelled within specific low-complexity regions (e.g. ChainSelf, L1H_gt500, SimpleRepeat_triTR_15to50, SimpleRepeat_homopolymer_4to6 and LD_discordant_haplotypes) for both datasets, although it underperformed in SimpleRepeat_quadTR_20to50 and segdups. We generated candidate haplotypes (more details refer to the Materials and methods) in some low-complexity regions and aligned them against the local reference sequence to recall more SNVs. This ensured that recalls were consistently higher for miniSNV across both datasets. However, due to inaccurate consensus sequence construction of repeat regions, more false positives may be identified, resulting in poor precision in some regions.

Impact of incorporating known common variants on the performance of SNVs identification

We further designed this section to assess whether integrating established genetic information enhances the accuracy, speed, and reliability of SNVs identification within various genomic contexts. We applied two HG002 sequencing datasets, i.e. k12- and k14-data, and detected the SNVs without relying on known common variants. The detailed results of this evaluation are documented in Supplementary Table S11 available online at http://bib.oxfordjournals.org/. The analysis indicates that incorporating known common variants from the population into the SNVs calling workflow not only significantly speeds up the detection process but also enhances the accuracy of the outcomes. Specifically, on the k12-data dataset, we observed a precision increase of 0.12% and an F1-score improvement of 0.04%. On the k14-data dataset, all evaluation metrics showed improvements: precision increased by 0.08%, recall by 0.16%, and F1-score by 0.12%. Additionally, the running speed reduced by ~15% on average.

In addition, we explored two SNVs at positions 62 534 004 on chromosome 1 and 20 104 697 on chromosome 10 of the GRCh38 reference genome, which surrounded by divergent indel patterns. These SNVs could be accurately detected when incorporating known common variants but were overlooked without their inclusion. Notably, these SNVs were also missed by both Clair3 and Longshot. The example, derived from the k14-data, is depicted in a snapshot from Integrative Genome Viewer shown in Supplementary Fig. S2 available online at http://bib.oxfordjournals.org/. This finding suggests that leveraging the prior knowledge of common variants within the population can enhance the reliability of identifying certain SNVs that frequently occur in populations but present complex signals due to sequencing errors and inconsistent read mapping. Given these substantial enhancements in performance, we recommend the integration of known common variants from the widespread large-scale population genome projects when conducting variant calling on human genomic data.

Results on various simulated datasets

Due to the limited availability of real sequencing datasets and ground truth variants beyond GIAB standard samples, and their extensive use as labeled datasets in DL model training, we conducted a simulation study to evaluate the performance and generalizability of DL versus statistical-based methods. We selected four species, i.e. Human, Mouse, Drosophila, and Arabidopsis, and generated ONT-like datasets using Pbsim3 (version 3.0.4). Chromosome 2 was used for both Human and Mouse genomes. Sequencing depths were set at 60× for Human and Arabidopsis thaliana, and 30× for Mouse and Drosophila. miniSNV was run without using known common variants. The details for the implementation of simulated datasets referred to the Supplementary Notes.

The evaluation results of the three competing callers across these four simulated datasets are presented in Fig. 5 (Supplementary Table S12 available online at http://bib.oxfordjournals.org/). In general, miniSNV was the best caller on identifying SNVs on simulated datasets, and followed by Longshot and Clair3. In the two 60× sequencing datasets, miniSNV outperformed Longshot by 0.36% (99.4% versus 99.58%) and 0.34% (99.90% versus 99.56%) in F1 scores, and by 0.75% (99.94% versus 99.18%) and 0.74% (99.90% versus 99.16%) in sensitivities, respectively. As for the relative low coverage sequencing datasets, the performance of miniSNV and Longshot was comparable, with both winning and losing. It was clear that miniSNV balanced between precision and recall while Longshot showed relative lower recalls. Importantly, we also found that the results of Clair3 were unexpectedly lower than that of miniSNV and Longshot due to the unacceptable recalls, especially, on the three non-human species. The huge difference in Clair3’s performance between real datasets and simulated datasets, may indicate that DL methods have difficulty in achieving good generalization when dealing with data from different species or with certain characteristic differences. In Supplementary Fig. S3 available online at http://bib.oxfordjournals.org/, two SNVs called from the simulated dataset of Arabidopsis could not be detected by Clair3, despite having clear and ample supporting signatures.

Benchmarking results on simulated datasets for different species.(A) F1-score, (B) precision and recall of different callers with simulated datasets for four different species.
Figure 5

Benchmarking results on simulated datasets for different species.(A) F1-score, (B) precision and recall of different callers with simulated datasets for four different species.

We investigated the impact of GC content, which ranges from 20% to 70% in various bacterial genomes, for its relevance to genome complexity on variant identification by simulating diploid-like datasets. These simulations aimed to assess the performance of different callers as GC content varied. The results are presented in Supplementary Table S13 available online at http://bib.oxfordjournals.org/. miniSNV consistently outperformed Longshot and Clair3 across all datasets. It was obvious that statistical-based methods were robust to genomes of varying GC content, and the DL methods could suffer from the availability of ample and reliable datasets with ground truth to learn the model for a specific genome.

The runtime and memory usage of different callers

We assessed the speed and memory footprints of the SNV callers on the HG002 dataset sequenced by kit v14 chemistry and basecalled by Dorado sup mode. The results are provided in Fig. 6A (Supplementary Table S14 available online at http://bib.oxfordjournals.org/). Generally, using a single CPU thread, miniSNV was the fastest caller followed by Clair3, and 4 times faster than Longshot. When using multiple CPU threads, miniSNV and Clair3 had high scaling performance, i.e. they achieved a quasi-linear speedup with the number of CPU threads and their wall clock time was greatly reduced. Longshot was not included mainly because it does not support multiple-thread computing. When considering the memory footprint, miniSNV (4.76 GB) was smaller than Clair3 (10.84 GB) and Longshot (88.13 GB) by 2 times and 20 times under a single CPU thread. More importantly, with the increase of threads used, the memory usage of miniSNV has no significant changes, which represents that miniSNV is suitable for parallel processing but without memory pressure.

Benchmarking results on simulated datasets for different species. The runtime (bar chart) and memory footprints (line chart) of various callers under (A) different CPU cores and (B) different sequencing coverage datasets.
Figure 6

Benchmarking results on simulated datasets for different species. The runtime (bar chart) and memory footprints (line chart) of various callers under (A) different CPU cores and (B) different sequencing coverage datasets.

We further applied the down-sampled datasets with sequencing coverages ranging from 30× to 70×, to benchmark the performance of miniSNV and Clair3 along with the data size using 16 CPU threads. The results are provided in Fig. 6B (Supplementary Table S14 available online at http://bib.oxfordjournals.org/). We found that the runtime of miniSNV decreased linearly as the amount of data size decreased. But Clair3 maintained a runtime of ~120 min for various datasets. It was worth noting that, for the 30× dataset, miniSNV (60 min) was faster than that of Clair3 (120 min) by 2 times. On the memory footprint, miniSNV held a stable and much lower memory consumption range from 4GB to 5GB with the increase in data size, while Clair3 increased linearly.

Discussion and Conclusion

Deep learning and statistical-based methods have been widely applied to variant calling with long reads sequencing, which can generate high-quality SNV calls and enable investigations in the repetitive and difficult-to-map genomic regions. Read-based phasing has been used to determine the haplotype structure and group the reads that are inherited from a single haplotype. Using the grouped reads, a consensus sequence can be generated that represents the most common sequence of the sample. The comprised two strategies, i.e. read-based phasing and consensus generation, can greatly improve the accuracy of variant identification by reducing false positives from systematic errors and help to discover variants that cannot be directly perceived due to limited signatures and alignment bias, by local realignment and groups of alleles inherited together.

Herein, we propose miniSNV, a novel SNV caller, to achieve accurate, fast, and scalable SNV identification with ONT long reads, mainly depending on the algorithm characteristics of the following four aspects.

  1. miniSNV uses known common variants in population studies as variation backgrounds to quickly identify high-quality variants, instead of using the whole pangenome. This lightweight usage can improve the power of variant identification and significantly reduce the computational complexity such as massive path search in variation graphs.

  2. miniSNV applies read pileup to find candidate loci heuristically and categorize them into high- and low-quality loci relying on the variation backgrounds and signature complexity. High- and low-quality loci are processed using different strategies to ensure accuracy and efficiency.

  3. miniSNV employs read-based phasing using long reads spanning multiple high-quality heterozygote variants, to determine and group the raw reads with specific variants to the corresponding haplotype. The phasing information greatly enhances signatures in a single haplotype which helps to distinguish from systematic errors.

  4. miniSNV adopts MSA and local assembly to generate a haplotype consensus sequence. All the differences between the consensus and reference genome by realignment are assigned together as variants. The consensus-based realignment helps to discover more variants omitted by alignment bias and limited signatures, especially for genomic regions outside high-confidence regions.

Real ONT sequencing datasets were implemented on miniSNV and another two state-of-the-art callers, Clair3 and Longshot, to evaluate the power of SNV identification and resource consumption. We derive three key observations from the benchmark. First, miniSNV demonstrates superior sensitivity and comparable accuracy, while maintaining remarkable stability and high performance in GIAB Ashkenazi Trio datasets, which facilitates family sample analysis and genomic studies focused on trios. Second, the comparison between two GIAB HG002 datasets differed in flowcell, chemistry, and basecaller demonstrates that miniSNV always obtains satisfying results (the only caller with F1-score over 99%) even on the ONT dataset with high sequencing errors (kit v12). Moreover, on both datasets, miniSNV demonstrates a more pronounced advantage in SNV identification in low-complexity and difficult-to-map genomic regions. Third, miniSNV is the fastest caller and achieves a nearly-linear multiple-thread speedup, while maintaining stable and relatively lower memory consumption, which is suited to modern HPC resources and helpful for large-scale genomics studies.

In addition, we conducted simulation studies on diverse species to further scrutinize the performance of three callers. miniSNV and longshot, grounded in statistical algorithms, maintain consistent performance regardless of species or sequencing depth. However, Clair3, which relies on DL, exhibits somewhat diminished stability and generalization, which is mainly reflected in the poor sensitivity. These findings indicate that, to some extent, variant calling methods utilizing DL have stringent demands on training data, and the availability of ample and reliable datasets for a specific species is vital, even though they are not always readily accessible. Consequently, for variant identification in other species using DL methods, fine-tuning may be necessary to optimize performance [33].

In this study, we focus on the detection and genotyping of SNVs alone, as accurate calling of short indels using ONT reads is challenging due to the systemic random sequencing errors, even with more accurate ONT data [34]. We will delve into the principles of ONT sequencing data generation and propose an advanced error model for indels in homopolymers and short tandem repeats to improve the yield of indel identification. In addition, we will apply our method to more sequencing platforms like PacBio and short-read sequencing technologies for widespread use in human genomic studies.

Overall, with its excellent SNV identification and running performance, miniSNV is suited to detect SNVs for ONT data. We believe that miniSNV can provide highly efficient and reliable analysis for cutting-edge genomic studies.

Key Points
  • miniSNV is an accurate and stable tool that is suitable for SNV detection from ONT sequencing data, regardless of the sequencing modes and detected species.

  • miniSNV utilizes read pileup and the prior knowledge of common variants in population studies to group the candidate loci and improve the sensitivity and speed of SNV detection.

  • miniSNV leverages local haplotype consensus generation and consensus-based realignment to achieve accurate SNV identification outside high-confidence regions.

  • miniSNV is a fast and lightweight tool that has outstanding scalability and lower memory footprints, which is suitable for most modern computers and helpful for large-scale genomics studies.

Conflict of interest: None declared.

Funding

This work has been supported by the National Key Research and Development Program of China (Nos: 2021YFF1200105, 2022YFF1202101), the Natural Science Foundation of China (No: 62172125, 62402140, 62472120, 62202125), the China Postdoctoral Science Foundation (No: 2022 M720965), and the Heilongjiang Postdoctoral Foundation (No: LBHZ22174).

Data availability

The real GIAB Ashkenazi Trio ONT sequencing data using kit v14 chemistry can be found at: https://labs.epi2me.io/askenazi-kit14-2022-12/ and real HG002 ONT sequencing data using kit v12 chemistry can be found at: https://labs.epi2me.io/gm24385_q20_2021.10/.

References

1.

Shastry
BS
.
SNP alleles in human disease and evolution
.
J Hum Genet
2002
;
47
:
0561
6
. .

2.

Nielsen
R
,
Paul
JS
,
Albrechtsen
A
. et al.
Genotype and SNP calling from next-generation sequencing data
.
Nat Rev Genet
2011
;
12
:
443
51
. .

3.

Shendure
J
,
Balasubramanian
S
,
Church
GM
. et al.
DNA sequencing at 40: past, present and future
.
Nature
2017
;
550
:
345
53
. .

4.

Bowden
R
,
Davies
RW
,
Heger
A
. et al.
Sequencing of human genomes with nanopore technology
.
Nat Commun
2019
;
10
:
1869
. .

5.

Edge
P
,
Bansal
V
.
Longshot enables accurate variant calling in diploid genomes from single-molecule long read sequencing
.
Nat Commun
2019
;
10
:
4660
. .

6.

Shafin
K
,
Pesout
T
,
Chang
PC
. et al.
Haplotype-aware variant calling with PEPPER-margin-DeepVariant enables high accuracy in nanopore long-reads
.
Nat Methods
2021
;
18
:
1322
32
. .

7.

Liu
Y
,
Jiang
T
,
Gao
Y
. et al.
Psi-caller: a lightweight short read-based variant caller with high speed and accuracy
.
Frontiers in Cell and Developmental Biology
2021
;
9
:731424. .

8.

Jain
M
,
Olsen
HE
,
Paten
B
. et al.
The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community
.
Genome Biol
2016
;
17
:
1
11
. .

9.

Roberts
RJ
,
Carneiro
MO
,
Schatz
MC
.
The advantages of SMRT sequencing
.
Genome Biol
2013
;
14
:
1
4
. .

10.

Sedlazeck
FJ
,
Lee
H
,
Darby
CA
. et al.
Piercing the dark matter: bioinformatics of long-range sequencing and mapping
.
Nat Rev Genet
2018
;
19
:
329
46
. .

11.

Jain
M
,
Koren
S
,
Miga
KH
. et al.
Nanopore sequencing and assembly of a human genome with ultra-long reads
.
Nat Biotechnol
2018
;
36
:
338
45
. .

12.

Shafin
K
,
Pesout
T
,
Lorig-Roach
R
. et al.
Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes
.
Nat Biotechnol
2020
;
38
:
1044
53
. .

13.

Wenger
AM
,
Peluso
P
,
Rowell
WJ
. et al.
Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome
.
Nat Biotechnol
2019
;
37
:
1155
62
. .

14.

Rang
FJ
,
Kloosterman
WP
,
de
Ridder
J
.
From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy
.
Genome Biol
2018
;
19
:
90
. .

15.

O. N. T. Dorado
.
(accessed 13 June, 2023)
.

16.

EPI2ME
.
(accessed 06 Dec, 2023)
.

17.

Zheng
Z
,
Li
S
,
Su
J
. et al.
Symphonizing pileup and full-alignment for deep learning-based long-read variant calling
.
Nature Computational Science
2022
;
2
:
797
803
. .

18.

Luo
R
,
Wong
CL
,
Wong
YS
. et al.
Exploring the limit of using a deep neural network on pileup data for germline variant calling
.
Nature Machine Intelligence
2020
;
2
:
220
7
. .

19.

Poplin
R
,
Chang
PC
,
Alexander
D
. et al.
A universal SNP and small-indel variant caller using deep neural networks
.
Nat Biotechnol
2018
;
36
:
983
7
. .

20.

Ahsan
MU
,
Liu
Q
,
Fang
L
. et al.
NanoCaller for accurate detection of SNPs and indels in difficult-to-map regions from long-read sequencing by haplotype-aware deep neural networks
.
Genome Biol
2021
;
22
:
1
33
. .

21.

Chaisson
MJ
,
Sanders
AD
,
Zhao
X
. et al.
Multi-platform discovery of haplotype-resolved structural variation in human genomes
.
Nat Commun
2019
;
10
:
1784
. .

22.

Chin
C-S
,
Peluso
P
,
Sedlazeck
FJ
. et al.
Phased diploid genome assembly with single-molecule real-time sequencing
.
Nat Methods
2016
;
13
:
1050
4
. .

23.

Edge
P
,
Bafna
V
,
Bansal
V
.
HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies
.
Genome Res
2017
;
27
:
801
12
. .

24.

Huddleston
J
,
Chaisson
MJP
,
Steinberg
KM
. et al.
Discovery and genotyping of structural variation from long-read haploid genome sequence data
.
Genome Res
2017
;
27
:
677
85
. .

26.

Li
H
.
Minimap2: pairwise alignment for nucleotide sequences
.
Bioinformatics
2018
;
34
:
3094
100
. .

27.

Patterson
M
,
Marschall
T
,
Pisanti
N
. et al.
WhatsHap: weighted haplotype assembly for future-generation sequencing reads
.
J Comput Biol
2015
;
22
:
498
509
. .

28.

Gao
Y
,
Liu
Y
,
Ma
Y
. et al.
abPOA: an SIMD-based C library for fast partial order alignment using adaptive band
.
Bioinformatics
2021
;
37
:
2209
11
. .

29.

Zhang
P
,
Liu
H
,
Wei
Y
. et al.
FMAlign2: a novel fast multiple nucleotide sequence alignment method for ultralong datasets
.
Bioinformatics
2024
;
40
:
btae014
.

30.

Tang
F
,
Chao
J
,
Wei
Y
. et al.
HAlign 3: fast multiple alignment of ultra-large numbers of similar DNA/RNA sequences
.
Mol Biol Evol
2022
;
39
:
msac166
. .

31.

Chen
X
,
Schulz-Trieglaff
O
,
Shaw
R
. et al.
Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications
.
Bioinformatics
2016
;
32
:
1220
2
. .

32.

Suzuki
H
,
Kasahara
M
.
Introducing difference recurrence relations for faster semi-global alignment of long sequences
.
BMC bioinformatics
2018
;
19
:
33
47
.

33.

Vrbančič
G
,
Podgorelec
V
.
Transfer learning with adaptive fine-tuning
.
IEEE Access
2020
;
8
:
196197
211
. .

34.

Kolmogorov
M
,
Billingsley
KJ
,
Mastoras
M
. et al.
Scalable Nanopore sequencing of human genomes provides a comprehensive view of haplotype-resolved variation and methylation
.
Nat Methods
2023
;
20
:
1483
92
. .

Author notes

Miao Cui and Yadong Liu contributed equally to this work.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]

Supplementary data