Abstract

Next-generation sequencing (NGS) has revolutionized the field of rare disease diagnostics. Whole exome and whole genome sequencing are now routinely used for diagnostic purposes; however, the overall diagnosis rate remains lower than expected. In this work, we review current approaches used for calling and interpretation of germline genetic variants in the human genome, and discuss the most important challenges that persist in the bioinformatic analysis of NGS data in medical genetics. We describe and attempt to quantitatively assess the remaining problems, such as the quality of the reference genome sequence, reproducible coverage biases, or variant calling accuracy in complex regions of the genome. We also discuss the prospects of switching to the complete human genome assembly or the human pan-genome and important caveats associated with such a switch. We touch on arguably the hardest problem of NGS data analysis for medical genomics, namely, the annotation of genetic variants and their subsequent interpretation. We highlight the most challenging aspects of annotation and prioritization of both coding and non-coding variants. Finally, we demonstrate the persistent prevalence of pathogenic variants in the coding genome, and outline research directions that may enhance the efficiency of NGS-based disease diagnostics.

INTRODUCTION

The emergence of next-generation sequencing (NGS) technology has transformed genetics and genomics. In contrast to earlier methods, nucleotide sequences of millions and billions of DNA fragments can be determined in a single NGS experiment. Rapid development and spread of NGS led to a dramatic reduction of sequencing costs, and made genomic analysis available for both researchers and clinicians. The most commonly used NGS platforms (such as Illumina) are based on short (usually 100–150 bp) sequencing reads (reviewed in [1]), though novel methods based on longer reads are becoming more popular (reviewed in [2]).

In medical genetics, NGS is applied to search for the alterations of the genome sequence (genetic variants) that cause rare inherited diseases [3]. These include both relatively simple variants, such as single-nucleotide variants (SNVs) or short (up to several dozen base pairs) insertions or deletions (indels), and more complex types of variants, such as complex chromosomal rearrangements (structural variants, SVs) or copy number variants (CNVs). Although the general goal of the analysis is relatively straightforward, many caveats remain which hinder efficient discovery of such causal genetic variants. It has been reported that the diagnosis rate from whole-genome sequencing (WGS) in trios is limited to a mean of 42%, and this rate significantly varies between different disease groups (reviewed in [4]). The diagnosis rate may also decrease if approaches other than WGS are used, such as whole-exome sequencing (WES) (i.e. targeted sequencing of all protein-coding sequences of the genome) or sequencing of target gene panels.

Bioinformatic analysis and interpretation is the core step in NGS-based diagnostics of rare disease. The main goal of bioinformatic analysis of NGS data in medical genetics is the comprehensive identification of all genetic variants present in the sample. Interpretation of NGS results, on the other hand, concerns with further filtering and classification of the identified variants with a goal of finding one or several variants causally linked to the patient’s phenotype. The efficiency of molecular diagnostics with NGS methods depends on multiple factors, and the incompleteness of knowledge regarding the pathogenetic mechanisms behind rare disease is arguably the most important of such factors. At the same time, data analysis and interpretation also represent a major challenge, and certain persistent problems and user errors may negatively impact the accuracy of the analysis and the diagnosis rate.

The main data analysis steps that are necessary for causal genetic variant identification are summarized in Figure 1. First, raw sequencing reads in the FASTQ format [5] are subjected to quality control (QC) and preprocessing (though the latter step may be omitted in some workflows, see next). After initial QC, reads are aligned to the reference genome sequence, and the alignment results in SAM or BAM format [6] are subjected to additional processing and refinement steps. The processed alignment data are used for variant calling, resulting in a raw VCF file containing all detected variants. This file may undergo additional filtering (depending on the variant calling software used), and then annotated using a variety of available software and resources. Annotated VCF file may then be used for variant interpretation, leading to selection of a variant (or variants) that are considered pathogenic (or likely pathogenic) and potentially causal for the disease in question. Presence of this variant is typically validated by visual inspection of alignments and Sanger sequencing [7].

Basic workflow of NGS data analysis in rare disease diagnostics. The diagram shows the principal steps of the analysis pipeline.
Figure 1

Basic workflow of NGS data analysis in rare disease diagnostics. The diagram shows the principal steps of the analysis pipeline.

In this review, we will discuss in more detail current bioinformatic approaches and remaining challenges in each of the aforementioned steps. Despite the ongoing introduction of long reads to medical genetics and their undoubtful advantages for SV analysis, we will focus on the analysis of short read data, which are most commonly used in rare disease diagnostics.

BASIC DATA ANALYSIS WORKFLOW FOR GERMLINE VARIANT CALLING

There exist a plethora of methods for performing the main analysis steps; hence, we did not intend to provide a comprehensive list of all software. Instead, next we will briefly outline the main analysis steps and highlight the best-performing methods based on recent benchmarks.

From raw data to analysis-ready alignments

Bioinformatic analysis of data starts with the raw sequencing reads. The reads are usually stored in a FASTQ file [5]. QC of raw sequencing reads is performed to examine the base qualities, nucleotide composition and adapter content of the reads. This step is almost exclusively performed using the FASTQC tool or multiQC [8], and may provide insights into the general quality of the sequencing run and library preparation. However, raw reads QC is not enough to conclude that the sequencing data are of sufficient quality, as many important features of the data cannot be examined without read alignment or variant calling results (see next).

After the initial QC step, additional preprocessing of reads may be performed to eliminate adapter sequences using tools such as FASTP [9] or Trimmomatic [10]. This step, however, may be omitted if the results of raw data QC do not indicate presence of detectable adapter sequences. Furthermore, trimming of adapters has been shown to have minimal effects on variant calling in both bacteria [11] and humans [12] in real-life datasets.

Alignment of reads onto the reference genome is typically performed using the BWA MEM [13, 14] tool. However, other well-performing options exist for read alignment, such as Isaac4 (https://github.com/Illumina/Isaac4) or Novoalign (http://novocraft.com/novoalign/). Our analysis of aligner performance showed that all commonly used tools perform similarly [15], with the exception of Bowtie2 [16], which should not be used for germline variant analysis. A more recent study showed that the Bowtie2 alignments can be used for variant calling, if the algorithm parameters are finely tuned, and the mapping quality scores are adjusted [17]. This finding, however, still argues against the routine usage of Bowtie2. Irrespective of the aligner, alignment data are stored in SAM or BAM format, and the SAMtools [6] is the standard tool to perform such operations as format conversion, sorting and indexing for alignment formats.

Mapped reads usually undergo another set of preprocessing steps. The first and the most common one is the identification of duplicate reads. Duplicate reads emerge for several reasons such as library amplification (PCR duplicates), as well as clustering or signal detection errors (optical duplicates). Duplicates introduce bias into the statistical assessment of candidate genetic variants, and marking of such reads helps to increase the accuracy of subsequent variant analysis. Marking of duplicate reads is commonly performed using the Genome Analysis ToolKit (GATK) [18] functionality; however, alternative solutions with better resource efficiency have been proposed, such as dopplemark (https://github.com/grailbio/doppelmark) or sambamba [19].

After marking duplicates, two additional alignment processing steps may be performed. The first of such additional steps is indel realignment, which is performed to correct alignment errors for reads bearing insertions or deletions near the start or end of the read. By default, such reads have a high probability of being soft-clipped by the alignment software or aligned with multiple mismatches instead of a gap. Indel realignment used to be a part of the standard GATK Best Practices workflow [20]; however, this procedure is now performed as part of the HaplotypeCaller algorithm. At the same time, other tools to solve the local realignment task have been developed, such as ABRA2 [21]. Another alignment processing step is the base quality score recalibration (BQSR), initially proposed by the GATK team [22]. The main goal of BQSR is to correct for the inaccuracies and biases in the base quality score estimates produced by the base calling software, which are known to be confounded by machine cycle and sequence context [22]. However, the effects of both indel realignment and BQSR on variant calling are subtle [23], and these steps can be omitted when using non-GATK variant calling in subsequent pipeline steps.

Calling of short variants

Genetic variants can be broadly separated into two groups: (i) short variants, i.e. single-nucleotide substitutions or insertions/deletions up to 50 bp in length; and (ii) SVs that include longer and more complex rearrangements, such as large insertions, duplications or deletions, as well as inversions and translocations. In principle, both types of variants can be called in two different ways: in single-sample mode or in cohort mode (for a set of samples). Each analysis type, however, requires specifically designed software.

Short variant discovery is relatively simple and straightforward compared with SV analysis. There are two principal approaches to short variant calling. Pileup-based callers utilize a compressed representation of the alignment information for each position, and usually call variants based on the prevalence of non-reference bases in reads. While this idea is utilized by the simplest variant callers such as SAMtools [6] and VarScan [24], such an approach can be extremely powerful when combined with sophisticated deep learning-based algorithms (e.g. DeepVariant [25] and Clair3 [26], with the latter tool primarily designed for long-read sequencing data). The other approach to short variant discovery is haplotype-based calling which includes local assembly of reads into haplotypes, usually followed by re-alignment of reads and genotyping. Such an approach is implemented in GATK HaplotypeCaller [20, 22], which is the most widely used germline variant calling software. Other haplotype-based methods include Freebayes [27], Strelka2 [28] and Octopus [29]. Machine learning (ML) algorithms are also used in haplotype-based callers; however, in this case ML is used for subsequent variant filtering rather than the calling itself. For example, a random forest filtering model is available in Octopus, while GATK includes filtering options based on convolutional neural networks (CNNs) [30].

All of the aforementioned software tools are usually used for single-sample variant calling. In contrast to single-sample mode, a particular benefit of the joint calling approach is that all non-missing genotypes (including reference homozygous) are reported in a final VCF file for all samples in a cohort, which is especially important for statistical comparison of variant frequencies, association studies and family-based studies. Furthermore, joint calling was shown to increase the statistical power of variant calling [7, 22]. To perform cohort genotyping, one needs to obtain candidate variants (usually, stored in a GVCF format) for each sample, and then combine these files and jointly genotype the cohort. Several variant callers support GVCF output and can perform joint calling of several samples (e.g. Octopus, Strelka2); at the same time, only GATK natively supports joint genotyping for large cohorts. A recently proposed software tool called GLnexus [31] enables cohort genotyping for GATK- and DeepVariant-generated GVCFs.

Given that the number of tools for short variant calling is large and growing, regular evaluation of their performance is crucial for the proper choice of instruments. Such benchmarking of variant callers is usually performed using gold standard datasets. Several such datasets are provided by the Genome In A Bottle (GIAB) consortium [32]. These include publicly available WES and WGS data for members of the three trios of different ethnicity [Central European from Utah (CEU) individual (HG001, also known as NA12878), the Ashkenazi trio (HG002, HG003, HG004) and the Chinese trio (HG005, HG006, HG007)]. These samples are constantly resequenced using various sequencing technologies. For example, Baid et al. made an effort to extensively sequence all members of the three GIAB trios using both short and long reads [33]. Another option for benchmarking is to use the synthetic diploid sequence [34].

Recently, we made an effort to systematically benchmark these tools using a set of gold standard WES and WGS datasets from GIAB. Our analysis demonstrated that DeepVariant showed the best performance across all datasets and variant types [12, 15]. More importantly, neither DeepVariant nor any other ML-based caller showed evidence of being overfitted for GIAB data, and all methods performed similarly on gold standard data and unrelated samples of diverse ethnicities. At the same time, we demonstrated that ML-based methods other than DeepVariant may be sensitive to high coverage, in particular, in WES samples. Similarly, significant presence of adapters in reads may negatively affect pileup-based methods such as Clair3 [12].

Another important issue is filtering of called variants, which is commonly advised to reduce the burden of false positive calls. Such filtering, however, may be problematic in a diagnostic framework. For example, variant filtering models in GATK may yield puzzling results in high-coverage WES datasets [15]. Similarly, variant filtering methods for cohort genotyping (available in GATK or GLnexus) can also be quite prone to errors if data quality and/or cohort size are below expectation, and several comparisons reported no performance gain from variant filtering in cohorts (e.g. [25]). As thus, additional quality-based filtering of variants has to be approached with much caution. It may be advised to keep filtered variant calls accessible to the interpreter, so that they may be, if necessary, inspected during reanalysis.

It is also important to note that the benchmarking of variant calling pipelines is useful not only for comparison of different tools, but also for evaluation of the validation of pipeline prior to its use in clinical practice. However, such validation should not be limited to evaluation using gold-standard datasets, and specific guidelines for pipeline validation have been compiled recently (e.g. [35, 36]). Recent European recommendations for WGS suggest including monozygotic twins datasets as well as previously solved cases into optimization procedure to confirm the correct variant detection and annotation [37]. Additional evaluation of the pipeline is also advised when recent tool versions or additional functionality are implemented [37].

Furthermore, QC, including evaluation of the variant call quality (reviewed in [38]), has to be performed at all analysis stages routinely [37]. A particularly important problem in the data QC for diagnostic purposes is the detection of sample contamination and sample swaps. Detection of contamination in human DNA sequencing is more difficult than cross-species contamination, and bioinformatic methods for contamination analysis have to rely on patterns of sequence variation. Most of the methods utilize read alignment data to assess contamination. The most commonly used tool, VerifyBamID, uses population allele frequency information from external sources to estimate the contamination rate in a maximum likelihood framework [39]. Recently, an ancestry-agnostic version of the tool, VerifyBamID2, was released [40]. In addition to alignment-based solutions, several methods for contamination analysis have been developed that work with variant call data in VCF files. These include such tools as Peddy [41] or QC3 [42]. These tools provide useful metrics such as X chromosome variant counts, general heterozygous-to-homozygous variant ratio and estimated cross-sample relatedness. All of these metrics are useful both to identify sample swaps and significant levels of contamination. Recently, a new approach to contamination analysis, Contamination from Homozygous Alternate Reference Reads (CHARR), was proposed [43]. This method is based on the so-called ‘reference allele infiltration’ and may be used to rapidly detect different levels of contamination with comparable or better efficiency than VerifyBamID.

SV and CNV calling

SV calling is markedly more complicated compared with short variant discovery, warranting a separate discussion. The detection of SVs is much more accurate with long read technologies, such as Oxford Nanopore (ONT) and Pacific Biosciences (PacBio). In particular, much higher sensitivity of SV detection is usually reported with long reads, and significantly more SVs can be identified in the same sample with long read sequencing (e.g. [44]). Nevertheless, short read data can also be used for this type of analysis, though with lower efficiency. Currently, there are multiple tools for SV detection from short reads, with DELLY [45], Manta [46] and Lumpy [47] being the most widely used ones (according to Google Scholar citations). Manta is also a part of the GATK-SV, an integrative multi-algorithm pipeline used to construct the largest publicly available SV dataset, gnomAD-SV [48]. New tools are actively being developed, such as Dysgu, which is also suitable for long reads [49], and ClinSV, which claims to achieve clinical-grade calling of SVs from short read data [50].

Different SV calling approaches have recently been reviewed by Mahmoud et al. [51]. Most of the commonly used software tools rely on read alignment results. Despite methodological differences, all of such tools utilize three types of information: paired read discordance, split reads and coverage depth. It is generally recommended to employ tools that integrate all of the aforementioned types of information in conjunction with a local read assembly [52]. GRIDSS [53] and Manta have been shown to be strong candidates, although the former reports candidate variants in the so-called breakend notation, which can pose challenges in terms of subsequent interpretation.

Regardless of their popularity, short-read-based methods appear to have hit a ceiling in terms of precision and recall. Frequently, these methods exhibit limitations in specific genomic regions such as low complexity, highly variable or highly repetitive regions [51]. It is noteworthy that the performance of SV calling tools can vary substantially depending on the size of the variant. As has been shown for deletions, nearly all methods tend to underestimate or overestimate the size of rearrangements [54]. Efforts have been made to address these challenges through ensemble calling, which involves combining the strengths of multiple tools to generate a consensus set of SVs (e.g. Parliament2 [55]); however, some studies suggest that using a single state-of-the-art tool is a more efficient approach [52].

One group of SVs that deserve a separate discussion are alterations in the number of copies of DNA segments termed copy number variations (CNVs). CNVs are mainly composed of deletions and duplications and constitute a significant portion of the human genome, accounting for approximately 4.8–9.5% of its total length (reviewed in [56]). To date, several dozen tools have been developed for the identification of CNV using short-read NGS data. The main limitation of these methods lies in detecting short variations that range in length from 50 to 500 bp [57]. Despite the broad range of algorithms available for CNV detection, the primary source of information utilized by the majority of them is read-depth (RD), which is also referred to as read-count (RC). RD-based methods can be classified into two main categories: reference-based methods, and those that detect CNV by comparing the RC of specific regions with the overall chromosome coverage (single sample-based methods). Most commonly used examples include CNVkit [58] and GATK gCNV [59]. Reference-based CNV detection requires a joint analysis of multiple samples, the presence of a pool of control samples or a corresponding model that has been trained on such a pool. Reference-based methods are generally advised for working with data types other than WGS, and accurate selection of reference samples is crucial to achieve higher accuracy [60].

Selection of an optimal CNV caller is complicated by low reproducibility of tool performance across different datasets, as well as limitations associated with the algorithm (e.g. reference-based versus single sample-based). More importantly, however, there is no gold standard dataset for a comprehensive benchmarking of CNV callers [61]. In recent years, several benchmark studies of various sizes have been conducted [61–65]. Making robust conclusions from these studies, however, has proven to be challenging due to the limited overlap in the methods they examine. Common observations from these studies include a biased distribution of detected CNV lengths among different tools and a generally low precision-recall ratio for both WES and WGS data analysis. Notably, the performance of methods for CNV detection from targeted gene panels was found to be significantly better [64], with DECoN [66] and panelcn.MOPS [67] exhibiting the highest metrics. Similarly to generic SV calling, new methods for CNV detection are frequently introduced (e.g. SawyCNV [68]), and ensemble-based approaches have demonstrated potential performance benefits, though no reliable solution using this approach has been provided yet [69].

Calling mosaic variants from short-read data

Mosaic variants (variants present in a fraction of the cells of an organism that arise during development) are responsible for a substantial fraction of rare disease. Both short variants and SVs can be mosaic, and notable examples of both mosaic SVs and SNVs have been described in the context of congenital rare disorders, such as neurological or developmental (e.g. [70–72]).

While mosaic variants are somatic, calling of such variants is different from somatic mutations in cancer, where a normal tissue sample is usually used to filter candidate variants. In the absence of the control sample, separation of bona fide mosaic variants from inherited germline variants, contamination or technical noise becomes a crucial and complicated task. Specialized variant calling tools have been developed to detect mosaic variants with the help of linked reads technology (e.g. Samovar [73] or LinkedSV [74]). Recently, Wang et al. have proposed a set of best practices for identifying mosaic variants from short-read WGS data [75]. The proposed method includes identification of candidate mosaic variants followed by extensive filtering to exclude germline variants (based on variant allele frequency and frequency in the population), sequencing errors and variants outside of accessible genomic regions (see next). According to these best practices, the initial discovery of candidate variants can be performed using either somatic (MuTect2 [76]) or germline (GATK HaplotypeCaller with the ploidy parameter of 50) callers. Still, all methods tend to reach a sensitivity limit around 65%, despite deep ( 250x) WGS data being used for the analysis [75].

KEY FACTORS AFFECTING THE ACCURACY OF VARIANT DISCOVERY

The reference genome quality

Alignment is the first major step of the NGS data analysis pipeline, and the quality of the reference genome sequence is undoubtedly a major factor affecting the results. Several issues with the reference genome quality have been raised over the recent years. First, it is important to mention that the commonly used reference genome sequence originates from the initial assembly obtained by the Human Genome Project [77, 78]. The assembly has been updated multiple times; however, almost 10 years have already passed since the most recent update (GRCh38 or hg38) was published in 2013. Its predecessor, GRCh37 (or hg19), was introduced in 2009. GRCh38 and hg19 represent two of the most commonly used human reference genome assemblies. Both have several commonly used versions which differ in the number of extra-chromosomal sequences, patches and alternative contigs (ALT, representing alternative sequences of specific chromosomal loci). For hg19, the b37 version provided by the BROAD Institute is frequently used; similarly, a standard version of hg38 has also been suggested by BROAD and is included into the GATK resource bundle (https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle).

Despite the fact that almost 10 years have passed since the introduction of the GRCh38 reference, the majority of laboratories that use NGS in clinical settings still use the previous version [79]. Delay with switching to GRCh38 in medical genetics is motivated by the fact that many commonly used resources with genetic variant information, such as 1000 Genomes [80], ESP6500 [81] or the Exome Aggregation Consortium (ExAC) [82], were initially constructed using the b37 reference. The 1000 Genomes project has presented a hg38-realigned version of the data in 2017 [83]. For ExAC, an hg38-based allele frequency information appeared only after the emergence of the Genome Aggregation Database (gnomAD) [84]. Importantly, for the gnomAD v2.1.1, the hg38-based dataset has been produced by lifting over the b37-based VCF, which produces slightly different results compared with the complete realignment and re-analysis of the raw data. However, native hg38-based frequencies are available in the more recent releases, gnomAD v3 [85] and v4. In general, switching to the hg38 reference genome assembly introduced several notable improvements, in particular, in light of variant calling [86, 87]. However, even this assembly has its problems stemming from the approaches used to generate it (reviewed in [88]).

In 2022, a breakthrough in the human genome assembly was made by Nurk et al. [89] who presented the first fully complete reference assembly of the human reference genome sequence, named Telomere-to-Telomere (T2T). It has already been demonstrated that this assembly improves the results of various analyses, including genetic variation [90]. Unlike previous versions of the human reference genome, T2T contains no gaps or extrachromosomal sequences. Compared with the GRCh38, the majority of the added sequence in T2T corresponds to segmental duplications and centromeres, enabling thorough analysis of these regions [91, 92]. Additionally, the copy number of at least several hundred protein-coding genes has been updated. Given these improvements, T2T can be expected to become a new standard reference in human genomics; however, one can expect that switching to a new assembly may take even more time than the previous b37-hg38 transition.

Besides the general completeness of the assembly, there are at least two major issues with the widely used reference genome builds: (i) presence of minor alleles, including known disease variants, in the sequence [93]; and (ii) poor representation of the alternative sequence variants, especially in the hypervariable regions such as the major histocompatibility (MHC) locus. Both issues have attracted significant attention in recent years.

For the former problem, several solutions have been proposed, including construction of a major allele reference (e.g. [94]) or specialized variant callers (e.g. [95]). These solutions, however, are not widely used due to their incompatibility with the standard variant calling pipelines. We proposed an alternative method to deal with the reference minor allele (RMA) problem—RMAHunter, a tool to identify possible RMA-related false positive and false negative calls in a VCF file [93]. This solution, however, does not alleviate the RMA-induced errors at earlier analysis stages, e.g. during the read alignment.

As any individual genome contains a large number of singletons, i.e. rare variants that are uniquely present in an individual genotype [82, 84], one can expect a substantial number of RMA in any haploid genome assembly. Our earlier analysis identified as many as 2094 954 RMAs in the b37 reference, 12 709 of which were located inside the coding genome sequence. A follow-up analysis of RMA presence in hg38 and T2T shows similar results (Figure 2A), with 11 897 coding RMA sites in hg38 and as many as 12 587 such sites in T2T. Of note, only 3003 coding RMAs were shared between hg38 and T2T. An example of a potentially clinically relevant RMA site shared by all three assemblies is the rs587606297 frameshift variant in the SCARF2 gene associated with the Van den Ende-Gupta syndrome (MIM 600920). The reference allele at this variant site has not been detected in any gnomAD individuals, making the presence of this allele in all three assemblies even more striking. In the whole genome, both hg38 and T2T contained above 2 million RMA sites, and, similarly, only 673 487 of these sites were shared (27.0%). These numbers illustrate the fact that RMA sites will remain in any consensus reference genome sequence; moreover, significant attention has to be paid when transferring the results to the T2T assembly given the substantial differences in nucleotide sequence.

Presence of reference minor alleles (RMA) (A) and low-mappability regions (B) in the major human reference genome assemblies used for human genetic variation analysis—b37, hg38 (with either all or only primary contigs), and the new T2T assembly [89]. On (A), RMA sites in T2T are separated according to their concordance with hg38. On (B), the number of bases with a fraction of non-uniquely mapped reads > 0.4 (left) or 0.95 (right) is shown for the gold-standard GIAB WES datasets (HG001–HG007) used in our recent study [15].
Figure 2

Presence of reference minor alleles (RMA) (A) and low-mappability regions (B) in the major human reference genome assemblies used for human genetic variation analysis—b37, hg38 (with either all or only primary contigs), and the new T2T assembly [89]. On (A), RMA sites in T2T are separated according to their concordance with hg38. On (B), the number of bases with a fraction of non-uniquely mapped reads > 0.4 (left) or 0.95 (right) is shown for the gold-standard GIAB WES datasets (HG001–HG007) used in our recent study [15].

While the RMA problem is inherent and almost unavoidable in any haploid consensus sequence, it may be solved by possible changes in the representation of the reference genome itself. One approach which is more and more commonly discussed is switching from a linear reference sequence to a graph-based genome representation. Such a reference genome graph should include most of the common sequence variants (and, especially, SVs), and is thus commonly referred to as a human pan-genome (reviewed in [96]). Alignment to a pan-genome graph is more computationally intensive compared with standard alignment to a linear reference; however, fast and accurate approaches to solve this task have recently been proposed (e.g. Giraffe [97] or the Seven Bridges GRAF pipeline [98]). An important advantage of the pan-genome-based approach is the ability to include all sequences that are relatively common in human populations but are missing from standard assemblies. Varying estimates of the proportion of such sequences have been made, though the majority of studies agree that each individual carries at least 0.5 Mbp of non-reference sequences [96]. The second benefit of the graph genome-based methods is their much higher accuracy in difficult-to-map MHC regions [99], as well as for the analysis of complex SVs. While complete switching to a pan-genome reference remains a remote perspective, increased application of long reads in medical genetics may also facilitate this process.

Sequencing technologies and coverage biases

Another factor that affects the accuracy of variant discovery from NGS data is sequence coverage, i.e. the number of reads aligned to a particular position in the reference genome. Apart from the mean depth of coverage (the number of reads aligned to an average genomic position), it is also important to consider the breadth of coverage—the proportion of bases that are covered with sufficient amounts of reads to enable accurate variant analysis. Breadth of coverage is determined by the uniformity of coverage [100] and existing coverage biases, which are plentiful and differ between platforms [101].

The issue of coverage bias remains in focus of researchers since the emergence of NGS methods and, in particular, WES, with multiple comparative studies performed over the recent decade (e.g. [102]). Coverage is known to be significantly less uniform in WES compared with WGS, especially if WGS libraries are prepared with PCR-free methods [103]. GC-content is frequently considered as the leading factor that negatively affects the evenness of coverage in WES [102, 104, 105].

Our recent results confirm that coverage unevenness remains a significant issue in WES [101]. However, an in-depth analysis of the determinants of low coverage in WES and WGS shows that short read mappability limitations are the single most important determinant of CDS coverage in the human genome. Low-mappability regions span from 400 000 to over 1000 000 base pairs of the human coding sequence, and the mappability problem is more pronounced for WES compared with PCR-free WGS due to lower insert size in many WES libraries [101]. Exome kit design (i.e. inclusion or exclusion of particular regions from the intended target list) is also an important predictor of low coverage for WES [101].

The newest exome capture kits, such as Agilent SureSelect v8 or xGen Exome Panel, introduce improvements both in terms of design and general coverage uniformity [106, 107]. However, a substantial proportion of CDS bases remain non-targeted by the most recent WES kits (e.g. 1.4 Mbp for Agilent SureSelect v8 and 1.9 Mbp—for xGen Exome). Moreover, these non-targeted regions for both of the aforementioned platforms have very low overlap with low-mappability regions (e.g. only |$\approx $| 11 000 bp overlap for Agilent SureSelect v8), indicating that the design of the WES baits remains a significant and independent factor affecting CDS coverage.

The aforementioned limitations of short read mappability are known to affect many important genes [101, 108]. Some of these, such as the spinal muscular atrophy genes SMN1 and SMN2, are completely represented by low-mappability sequences, while other genes (e.g. neuromuscular disease-associated NEB or TTN) include multiple low-mappability parts. Indeed, the amount of bases with low mappability depends on the reference genome assembly used, with segmental duplications being one of the major sources of mappability issues. Importantly, assemblies also substantially differ in the span of regions completely covered by reads with multiple mapping locations (Figure 2B). When all hg38 contigs are used without the proper ALT-aware alignment, up to 1.5 Mbp of CDS sequences fall into the low-mappability category. This observation is consistent with earlier reports by other authors that emphasized the importance of ALT-aware alignment for comprehensive variant calling [60, 109]. When only primary contigs are considered, T2T consistently shows the minimal number of bases covered by non-unique mappers (median number is 430 000 bp), while b37 remains the worst with a median span of low-mappability regions of 662 000 bp. These figures further support the potential importance of the switching to a complete reference genome assembly in medical genomics. At the same time, it is important to note that no changes to human reference genome assembly can completely eliminate the mappability problem, emphasizing the importance of switching to long read sequencing technologies.

Variant calling performance in challenging regions

State-of-the art variant callers demonstrate very good performance in recent benchmarks, with F1 scores well above 99% [15]. However, such a good result is likely explained by the fact that evaluation of the accuracy of variant calling is typically performed using high-confidence regions (reviewed in [99]). These sets of regions are maintained by GIAB for each of the seven gold standard samples (HG001-HG007), and represent the parts of the genome for which high-confidence ground truth variants are available. High-confidence regions broaden over time, owing to the GIAB continuous efforts to improve the dataset using various sequencing technologies, such as linked and long reads [110–112]. For example, high-confidence regions for the latest version of the GIAB dataset (v. 4.2, [112]) include 6% more bases compared with earlier versions, and the additional sequences correspond to various challenging regions. Still, the common (overlapping between all samples) high-confidence regions span only 2.37 Gbp (75.2% of the human genome sequence), and 30.4 Mbp of the coding sequence (86.6%) [15].

The remaining regions, spanning as much as 4.7 Mbp of CDS, are the most challenging, hard-to-call intervals. These regions include locations of SVs and segmental duplications, some of the low-mappability regions, as well as the major histocompatibility complex (MHC) locus [112]. Within the coding genome, the majority of such intervals reside on chromosome X, which is not included into the high-confidence intervals set for all samples except NA12878 (HG001). The X chromosome accounts for 1.3 out 4.7 Mbp of hard-to-call coding regions. Among the autosomes, the largest amount (0.4 Mbp) of hard-to-call regions are located on chromosome 19. Notably, only half of the low mappability regions are not included into the high-confidence set, which corroborates the description by Wagner et al. [112]. The set of genes corresponding to challenging autosomal regions overlaps CDS of 1116 genes. This list of genes is enriched with immune system and hemostasis genes, as well as signaling pathway components. Notable examples of medically important genes with challenging coding sequence include several members of the collagen gene family, such as COL6A2 (linked to Ulrich myodystrophy (MIM 254090) and Bethlem myopathy (MIM 158810)), COL27A1 (causal gene for the Steel syndrome, MIM 615155) and COL11A2 (associated with deafness and other diseases of the ear).

How well do variant callers perform outside of the high-confidence regions? While the ground truth variants are not known for these regions, it is possible to evaluate the performance on challenging regions that are already included in GIAB v.4.2. Such an analysis performed by Olson et al. showed that most pipelines that were used in the precisionFDA Truth Challenge had a substantial drop in performance in difficult-to-map regions (up to 10% in F1 score), and the performance was only slightly better in the MHC region [113]. Similarly, a significant drop in performance can be seen when comparing the performance of the same pipelines on a broader (GIAB v. 4.2) and smaller (GIAB v. 3.3) set of high-confidence regions [15]. However, none of the aforementioned comparisons can provide insights into the performance of variant callers outside of GIAB v. 4.2 high-confidence regions.

In our recent study, we proposed a truth set-free procedure to estimate the accuracy of variant calling by different tools using a set of concordance-based metrics [15]. For example, we showed that the number of variants uniquely called or uniquely missed by a particular variant caller is proportional to its general performance in a benchmark. Notably, our follow-up analysis shows that the percentage of unique calls and unique non-calls is up to 100 times greater outside of high-confidence regions (Figure 3). Furthermore, the results show that the proportion of uniquely called or uniquely missed variants is similar for challenging autosomal CDS regions and for sex chromosomes. Most callers report from 1 to 10% unique variants in non-high-confidence regions, although best-performing solutions (e.g. DeepVariant) tend to have a lower burden of unique calls and unique non-calls in all cases. These results support the findings of [112, 113], and suggest that variant calls outside of high-confidence regions have to be treated cautiously when analyzing and interpreting clinical NGS data. It is important to note that there are more than 4000 such variants within an individual exome (estimated using the HG001 GIAB dataset), and as many as 17 607 (12.9%) known pathogenic variants fall outside of high-confidence regions. These figures are greater than expected given the total length of these intervals, which is likely influenced by the hypervariable and clinically significant MHC region.

The performance of variant callers outside of the high-confidence benchmarking regions evaluated using the proportion of uniquely called (top) or uniquely non-called (bottom) variants (see [15] for the description of the metrics). The evaluation was performed using the gold-standard GIAB WES and WGS datasets (HG001–HG007). CL–Clair3, DV–DeepVariant, FB–Freebayes, G1 and GH–GATK HaplotypeCaller with 1D CNN or hard filtering, OF and OS–Octopus with either random forest or standard filtering, ST–Strelka2.
Figure 3

The performance of variant callers outside of the high-confidence benchmarking regions evaluated using the proportion of uniquely called (top) or uniquely non-called (bottom) variants (see [15] for the description of the metrics). The evaluation was performed using the gold-standard GIAB WES and WGS datasets (HG001–HG007). CL–Clair3, DV–DeepVariant, FB–Freebayes, G1 and GH–GATK HaplotypeCaller with 1D CNN or hard filtering, OF and OS–Octopus with either random forest or standard filtering, ST–Strelka2.

Moreover, the aforementioned results indicate that variants on chromosomes X and Y are also challenging for variant calling pipelines. This result is important given the large number of medically relevant genes located on chromosomes X and Y (e.g. DMD, SRY). For example, 10 604 known pathogenic variants are located on these chromosomes in ClinVar (v. 2023-03-26) (7.8% of all pathogenic variants, which is more than expected by chance (4.0%)). These numbers indicate that more attention has to be paid to variant calling and genotyping within sex chromosomes.

BIOINFORMATIC METHODS FOR VARIANT ANNOTATION AND INTERPRETATION

Interpretation of the results is arguably the most challenging step in NGS-based molecular diagnostics of rare disease. The main goal of variant interpretation is to identify a single variant (or several variants) which are causal for the patient’s phenotype. This task is complicated by an enormous number of genetic variants identified in each single sample. Typically, three to four million variants are identified in an individual WGS sample, with around 25 000 variants found in CDS for both WGS and WES datasets [101]. The number is reproducibly higher for individuals of African ancestry (more than 30 000 CDS variants identified from a typical WGS experiment). Identification of a single causal mutation in such a large set of variants is not straightforward, and many tools have been proposed to assist with this process. Accuracy of variant interpretation greatly depends on both bioinformatic methods (e.g. computational annotation and prioritization of the identified variants) and clinical genetics factors (e.g. quality of the patient’s phenotype description, expertise of the person responsible for interpretation, or the genetic architecture of a particular disease). In this section, we will focus solely on bioinformatic aspects of efficient variant interpretation; however, it is important to emphasize the critical importance of a correct clinical diagnosis and phenotype description for the successful variant interpretation.

Multiple sets of guidelines and recommendations for interpretation of NGS results in medical genetics have been developed. American College of Medical Genetics and Genomics (ACMG) guidelines [114] are considered the gold standard for germline variant interpretation. These guidelines include a fixed number of criteria which assess the potential pathogenicity of a variant based on different kinds of supporting evidence. A set of rules is then applied to classify a variant into one of the five categories based on these criteria: pathogenic, likely pathogenic, likely benign, benign or variant of uncertain significance (VUS). Web-based platforms such as Franklin (https://franklin.genoox.com/clinical-db/home) have been developed to automatically classify a given variant according to ACMG criteria. Further extensions of the ACMG guidelines (e.g. Sherloc [115]) have been proposed to formalize the classification procedure based on numerical scoring schemes. However, fully automated classification of variants remains challenging.

The basics of variant annotation: transcripts and allele frequencies

Correct interpretation of results relies on accurate and comprehensive annotation of genetic variants with the relevant information available. There are three most commonly used tools for variant annotation—SnpEff [116], ANNOVAR [117, 118] and Ensembl Variant Effect Predictor (VEP) [119]. Google Scholar Citations (accessed 2023/04/17) indicates that ANNOVAR is the most cited tool, with more than 11 000 citations to date.

It has previously been noted that the annotation software may yield different results for the same query variant [120, 121]. One reason for the discordance in annotation is the differences in the genome annotation (sets of transcripts) used during annotation. According to earlier reports, the discordance between annotations may amount to as much as 90% for certain classes of variants, and more than 40% for putative loss-of-function (pLoF) variants [120]. While the discordance rate between the tools is lower than between annotations, the concordance of annotations for indel variants is rather low (less than 90%) [121].

ACMG guidelines for variant interpretation suggest that the variants should be interpreted according to the National Center for Biotechnology Information (NCBI) RefSeq [122] or Locus Reference Genomic (LRG) [123] transcript sequences [114]. At the same time, global resources of genetic variation such as gnomAD or National Heart, Lung, and Blood Institute (NHLBI) TopMed [124] report variants according to Ensembl/GENCODE [125, 126] transcripts. Moreover, a richer set of Ensembl transcripts may be helpful in certain complex cases. For example, Schoch et al. described a pathogenic effect of the deletion in the KMT2C gene causing Kleefstra syndrome type 2 (MIM 617768) [127]. Other deletions in the same gene are present in healthy individuals; however, such deletions only affect proximal exons of the RefSeq transcript (NM_170606.3). Ensembl, however, lists additional shorter transcripts for the same gene (e.g. ENST00000424877.5 and ENST00000360104.7), which are not affected by common deletions but are disrupted by a pathogenic deletion in the patient.

In light of the aforementioned data, it is not surprising that different sets of transcripts are used both in non-clinical and clinical practice [79], indicating a need for a unified transcript annotation for clinical genetics. Finally, a joint effort was made in 2022 by NCBI and EMBL-EBI to construct such a set of manually curated transcripts for clinical genetics called Matched Annotation from NCBI and EMBL-EBI (MANE) [128]. Switching to this transcript set would greatly decrease the burden of genome annotation differences.

Besides the choice of transcripts for annotation, allele frequency in the population is also important for ascertaining the pathogenicity of a variant. Allele frequency information helps to filter out many candidate variants which are actually benign and present at a high frequency in healthy individuals. The allele frequency information is usually obtained from large-scale global sequencing projects which provide publicly available summary data, such as the 1000 Genomes project [80], or gnomAD [84]. While global frequency information is plentiful, local allele frequency information is also desirable. It has previously been shown that using the maximum allele frequency across populations (popmax) helps to limit the set of candidate variants for interpretation [82]. Furthermore, our analysis demonstrated the utility of population-specific variants for identification of low-confidence pLoF variants [129]. Multiple ancestry-specific allele frequency databases have been established worldwide (e.g. [130–132]). A substantial number of variants absent from global population databases can be present in healthy donors from underrepresented populations [131], emphasizing the importance of ancestry-specific information for accurate annotation and prioritization of variants.

Multi-nucleotide polymorphisms

Multi-nucleotide polymorphisms (MNP) are combinations of single-nucleotide substitutions in two or more adjacent positions. If several substitutions that constitute an MNP are located within the same codon, their effect on the coding sequence has to be analyzed jointly. However, a default behavior of variant annotation tools is to consider each of the called genetic variants separately, thus leading to potential misinterpretation of the effects of MNPs. Despite the generally low density of genetic variants in an individual genome, several dozen MNPs are present in any sample as shown by ExAC and gnomAD [82, 133]. The problem of MNPs becomes even more important if RMA are taken into account. As shown in our analysis [93], more than 10 000 coding variants can be misclassified due to the presence of an RMA in the same codon. This number also includes known pathogenic variants, such as the rs104894197 variant in the ALX4 gene for which the expected consequence changes from nonsense (p.Ser270*) to missense (p.Ser270Trp) if a neighboring synonymous RMA site (rs10769028) is accounted for. Specialized tools for annotation of MNP effects have been developed (e.g. COPE [134]), but have not become widely used. Confident identification of MNPs also requires genotype phasing, as two heterozygous SNVs in the same codon might be located on different haplotypes [133].

Challenges in predicting the loss-of-function potential of coding variants

Loss-of-function (LoF) is the main mechanism of rare disease pathogenesis. Hence, identification of protein-coding genetic variants with LoF effects is crucial for diagnostics of rare disease. Three classes of genetic variants are usually considered as pLoF variants: nonsense (stop_gained) variants, frameshift insertions and deletions and canonical splice site variants. These variants are also called protein-truncating variants (PTVs), as most of such variants lead to premature termination of protein synthesis. Despite the anticipated large effects of PTVs on gene function, each individual carries up to 100 such variants [82], indicating that additional filters are required to identify genuine LoF variants among the PTVs. Several rules for filtering out low-confidence LoF (LC LoF) variants have been developed, and variant annotator plugins such as LOFTEE [84] have been implemented to classify pLoF variants according to these rules. For example, LOFTEE considers variants within the 5% of the end of the coding sequence as LC LoF. Similarly, an LC LoF flag is assigned to variants in canonical splice sites for introns less than 15 bp long.

However, predefined sets of rules are insufficient to filter out all false positive LoF variants. Gene expression data are also widely used to determine the actual LoF potential of a variant. A recently proposed proportion of expressed transcripts (pext) score [135] utilizes gene expression information from the Genotype Tissue Expression [136] to identify and exclude variants that only affect isoforms with low or zero expression levels. However, as shown in our recent analysis of population-specific PTVs [129], pext is not sufficient to filter out all LC LoF variants. Other approaches for classifying pLoF variants have also been proposed. One example is a machine-learning-based algorithm called MutPred-LoF [137], which showed high efficiency in predicting pathogenicity of nonsense and frameshift variants. Nevertheless, the predictive power of the proposed methods is still far from ideal.

The problem of predicting the LoF effects is significantly more complex for missense variants. Several types of evidence are usually used to solve this task, including data on evolutionary conservation of protein sequences, 3D protein structure and chemical properties of the reference and mutant amino acids. Dozens of bioinformatic tools for prediction of pathogenicity of missense variants have been proposed. The most commonly used algorithms (according to Google Scholar Citations) are SIFT [138, 139], PolyPhen2 [140] and CADD [141]. The large number of tools and high levels of inconsistency, however, led to the emergence of algorithms that aggregate predictions from different software (e.g. REVEL [142] or [143]). Moreover, databases that store pre-computed predictions have been constructed, such as dbNSFP [144]. New algorithms, however, are still being actively developed (e.g. MutPred2, [145]), and strategies for interpreting pathogenicity prediction scores in the ACMG framework have been proposed [146].

Besides missense and other high-impact variants in the coding sequence, other variants that are usually considered silent (i.e. not affecting the gene product function) can also lead to loss of protein function. Multiple examples of unexpected functional effects of synonymous variants have been reported, and different mechanisms of pathogenic effects of these variants have been described. These include introduction or disruption of splice sites and negative effects on translation rate (e.g. [147]), For example, a synonymous variant NM_001077488.2:c.108C>A [p.Val36Val] in the GNAS gene has been reported to result in a cryptic splice site leading to missplicing of the corresponding transcript and causing pseudohypoparathyroidism [148]. Development of computational predictors for synonymous variants is complicated, though several efforts have been made in this direction (e.g. syn-Vep [149]).

One of the important factors that may affect the performance of predictive algorithms is the specificity in the intragenic localization of pathogenic variants. Multiple examples of clustering of pathogenic variants in specific gene regions have been noted. For example, pathogenic variants that cause Floating-Harbor syndrome are uniquely localized in the 34th exon of the SRCAP gene, as confirmed in recent reports [150, 151]. The issue of non-uniform distribution of pathogenic variants has attracted attention of researchers for many years (e.g. [152]). Recently, Laddach et al. investigated the specific patterns of the distribution of pathogenic missense variants compared with neutral ones [153]. In particular, Laddach et al. found that pathogenic variants are enriched in protein cores and protein interaction sites, in contrast to both common and rare variants from a healthy population. However, further research is required to create a comprehensive map of clinical significance for each site in disease genes.

Annotation and interpretation of noncoding genetic variants

It is often hypothesized that one of the reasons explaining the low diagnosis rate of inherited disease from NGS is the inability to interpret the pathogenicity of variants located outside of the coding genome sequences [4]. Indeed, the noncoding variants are dramatically underrepresented among high-confidence pathogenic variants across all databases [154]. Moreover, our analysis shows that the proportion of noncoding variants among the known pathogenic ones remains constant over the last 5 years (Figure 4), likely reflecting both a limited proportion of noncoding variants among the true disease-causing ones as well as the lack of evidence to classify noncoding variants as pathogenic.

The proportion of coding variants among the pathogenic ones in the ClinVar database remains constant over the last 5 years. The numbers were estimated using the corresponding weekly ClinVar versions and the GENCODE v43 human genome annotation.
Figure 4

The proportion of coding variants among the pathogenic ones in the ClinVar database remains constant over the last 5 years. The numbers were estimated using the corresponding weekly ClinVar versions and the GENCODE v43 human genome annotation.

In contrast to protein-coding variants discussed in the previous section, computational prediction of the functional consequences of noncoding variants is even more challenging. However, multiple methods have recently emerged to make such predictions. These methods are based on two principal types of evidence: (i) the level of evolutionary conservation and sequence context of the corresponding genomic region; and (ii) effect of variant on known features of the noncoding genome.

In the first category, generic models for the analysis of evolutionary conservation of genomic regions, such as GERP++ [155] or phastCons [156], can be leveraged for prioritization of noncoding variants. Furthermore, specialized predictors of deleterious effects of noncoding variants based on sequence context and evolutionary conservation status have been proposed (e.g. FATHMM-MKL [157]). Among the functional element-specific methods, much of the attention is focused on evaluating the functional effects of intronic variants, variants affecting known regulatory elements, variants in the 5’- or 3’-untranslated regions (UTRs), and variants affecting annotated noncoding RNA sequences. For intronic variants, SpliceAI has become a widely used standard tool for predicting the effects of variants on splicing [158]. For UTR variants, tools to predict the impact of such variants on the function of upstream open reading frames (uORFs) have been developed [159, 160].

Despite the fact that interpretation of the noncoding variants remains a significant problem for over a decade, the first ACMG-based guidelines for noncoding variant interpretation were published only in 2022 [154]. Notably, these guidelines suggest that only known elements of the noncoding genome (UTRs, introns, promoters and other cis-regulatory elements) functionally linked to the target gene have to be considered during interpretation to decrease the amount of candidate variants. Moreover, evidence supporting the role of a particular element of the noncoding genome should be present in relevant tissues and cell types. These rules predicate a need for the development of new context-aware tools for annotation of noncoding variants. Such tools should consider the functional relevance of the genomic element of interest as well as the impact of a particular variant on this element.

Periodic reanalysis of data for undiagnosed cases

It is also important to emphasize the necessity of periodic re-analysis of the data for undiagnosed cases, as suggested by multiple studies (e.g. [161, 162]). The increase in diagnostic rates from reanalysis varies but is usually substantial. A literature review by Tan et al. showed a median rate of new diagnoses of 15% [163]. Some studies also report improved clinical outcomes for patients who received a diagnosis upon reanalysis [164]. Given the pace of new technological advancements, reanalysis of undiagnosed cases can become even more effective. However, we hope that new sequencing technologies and bioinformatic approaches will help to significantly decrease the number of undiagnosed cases after genomic diagnostics.

CONCLUSION

As demonstrated in this review, NGS data analysis and interpretation remains a daunting task, and various factors negatively affect the overall diagnosis rate for NGS-based rare disease diagnostics. Among these, variant calling in challenging genome regions, defined by limitations of short read sequencing, appears as the most influential factor, affecting up to 15% of all coding sequences in the human genome and more than 12% of all known pathogenic variants. To reduce the impact of this issue on routine data analysis and variant interpretation, one might be advised to use the best-performing combination of software (e.g. BWA MEM aligner and DeepVaraint caller) and be more cautious when interpreting variants in challenging regions. For example, the reliability of the corresponding region should be considered during interpretation (e.g. using a scoring system such as DangerTrack [165]). While other factors, such as errors in the reference genome or annotation inconsistency, seem less influential, these issues also have to be borne in mind upon interpretation to avoid costly mistakes.

As discussed above, a substantial fraction of undiagnosed cases may be due to a lack of biological knowledge of rare disease mechanisms. It seems unlikely that this problem can be solved in the nearest future, though large-scale genomic projects focused on rare disease genetics are of great importance in this context. For example, numerous new findings have been made by the 100 000 Genomes Project by Genomics England (e.g. [166, 167]). Further research into the genetic causes of rare disease may be expected to significantly increase the diagnosis rates for NGS-based diagnostics. In addition to the identification of new disease genes, functional genomic studies into the pathogenetic mechanisms of inherited disorders are also of crucial importance. In particular, investigation of the epistatic interactions between genes, as well as complex gene-environment interactions, may be helpful to predict the functional and clinical effects of genetic variants in individual genomes.

Besides the lack of biological knowledge, diagnosis rates are influenced by technical limitations of the short read-based NGS technologies and linear reference-based analysis methods. The ongoing process of making long read sequencing available for clinical diagnostics can be expected to solve many of the problems mentioned in this review (discussed in more details in [99, 168]). If these developments are combined with the recent breakthroughs in the reference genome assembly, as well as efforts to construct a human pan-genome [169, 170], comprehensive analysis of all genetic variants present in a particular individual may be achieved in the upcoming years.

Key Points
  • Identification of germline genetic variants from NGS data involves several stages: read alignment, alignment preprocessing and variant calling. Each of these steps could be performed with various tools; BWA and DeepVariant appear to be the most accurate aligner and variant caller.

  • Inaccurate variant calling in challenging genomic regions is the most influential factor that negatively impacts variant discovery in as much as 4.7 Mbp of the coding genome sequences. Mappability limitations of short reads and reference minor alleles also affect the results, but to a lesser extent.

  • Prediction of the deleterious effects of genetic variants, both inside and, especially, outside of the coding genome, is one of the most important challenges in bioinformatic variant annotation. Other important issues in this area include the completeness and consistency of the genome annotation and the availability of local allele frequency information.

  • Further research into the molecular mechanisms of inherited disease combined with wider adoption of long read sequencing and a switch to the complete human genome assembly (or the human pangenome reference) will help to overcome the major persistent challenges.

FUNDING

This work was supported by the Ministry of Science and Higher Education of Russian Federation (project Multicenter research bioresource collection ‘Human Reproductive Health’ contract No. 075-15-2021-1058 from 28 September 2021).

AUTHOR CONTRIBUTIONS STATEMENT

All authors participated in the writing of the original draft, as well as its review and editing.

Author Biographies

Yury Barbitoff is the bioinformatics group leader at the Dpt. of Genomic Medicine of the D.O.Ott Research Institute of Obstetrics, Gynaecology, and Reproductology since 2019, and served as a director of research at the Bioinformatics Institute from 2017 to 2022. The main field of research of Dr. Barbitoff’s group is the development of new tools for the analysis of next-generation sequencing data in human genetics. The group has a specific focus on the development and application of bioinformatic methods for rare disease diagnostics.

Mikhail Ushakov is a staff scientist at the Dpt. of Genomic Medicine of the D.O.Ott Research Institute of Obstetrics, Gynaecology, and Reproductology. His work is mainly related to the development of bioinformatic pipelines for variant calling from NGS data.

Tatyana Lazareva works as a staff scientist at the Dpt. of Genomic Medicine of the D.O.Ott Research Institute of Obstetrics, Gynaecology, and Reproductology. Her research focus is the analysis of various factors that determine the impact of genetic variants on the phenotype. She also develops new methods for NGS data analysis and interpretation in medicalgenetics.

Yulia Nasykhova is the head of the Genomics laboratory at the Dpt. of Genomic Medicine of the D.O.Ott Research Institute of Obstetrics, Gynaecology, and Reproductology. Dr. Nasykhova’s laboratory conducts research in the field of human genetics and maintains the “Human Reproductive Health” bioresource collection at the institute.

Andrey Glotov is the Head of the Dpt. of Genomic Medicine of the D.O.Ott Research Institute of Obstetrics, Gynaecology, and Reproductology. Dr. Glotov supervises all research at the Department, which is dedicated to various aspects of humangenetics, with a specific focus on the genetic aspects of human reproduction.

Alexander Predeus is currently a principal bioinformatician at the Wellcome Sanger Institute, UK. His current work is focused on developing novel methods for the analysis of single cell data. During his tenure with Bioinformatics Institute (2015-2017) Alexander served as director of science and worked on human genomics and its applications to rare disease. Alexander is currently a scientific advisor at Bioinformatics Institute.

References

1.

Goodwin
S
,
McPherson
JD
,
Richard McCombie
W
.
Coming of age: ten years of next-generation sequencing technologies
.
Nat Rev Genet
2016
;
17
(
6
):
333
51
.

2.

Logsdon
GA
,
Vollger
MR
,
Eichler
EE
.
Long-read human genome sequencing and its applications
.
Nat Rev Genet
2020
;
21
(
10
):
597
614
.

3.

Biesecker
LG
,
Green
RC
.
Diagnostic clinical genome and exome sequencing
.
New Eng J Med
2014
;
370
(
25
):
2418
25
.

4.

Wright
CF
,
FitzPatrick
DR
,
Firth
HV
.
Paediatric genomics: diagnosing rare disease in children
.
Nat Rev Genet
2018
;
19
(
5
):
253
68
.

5.

Cock
PJA
,
Fields
CJ
,
Goto
N
, et al.
The sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants
.
Nucleic Acids Res
2009
;
38
(
6
):
1767
71
.

6.

Li
H
,
Handsaker
B
,
Wysoker
A
, et al.
The sequence alignment/map format and SAMtools
.
Bioinformatics
2009
;
25
(
16
):
2078
9
.

7.

Koboldt
DC
.
Best practices for variant calling in clinical sequencing
.
Genome Med
2020
;
12
(
1
):
1
13
.

8.

Ewels
P
,
Magnusson
M
,
Lundin
S
,
Käller
M
.
MultiQC: summarize analysis results for multiple tools and samples in a single report
.
Bioinformatics
2016
;
32
(
19
):
3047
8
.

9.

Chen
S
,
Zhou
Y
,
Chen
Y
,
Jia
G
.
Fastp: an ultra-fast all-in-one FASTQ preprocessor
.
Bioinformatics
2018
;
34
(
17
):
i884
90
.

10.

Bolger
AM
,
Lohse
M
,
Usadel
B
.
Trimmomatic: a flexible trimmer for Illumina sequence data
.
Bioinformatics
2014
;
30
(
15
):
2114
20
.

11.

Bush
SJ
.
Read trimming has minimal effect on bacterial SNP-calling accuracy
.
Microb Genom
2020
;
6
(
12
):
1
13
.

12.

Barbitoff
YA
,
Predeus
AV
.
Negligible effects of read trimming on the accuracy of germline short variant calling in the human genome
bioRxiv
,
2023
.

13.

Li
H
,
Durbin
R
.
Fast and accurate short read alignment with burrows-wheeler transform
.
Bioinformatics (Oxford, England)
2009
;
25
(
14
):
1754
60
.

14.

Md.
Vasimuddin
,
Sanchit
Misra
,
Heng
Li
, and
Srinivas
Aluru
.
Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems
. In:
2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
, pages
314
24
.
IEEE
,
2019
.

15.

Barbitoff
YA
,
Abasov
R
,
Tvorogova
VE
, et al.
Systematic benchmark of state-of-the-art variant calling pipelines identifies major factors affecting accuracy of coding sequence variant discovery
.
BMC Genomics
2022
;
23
(
1
):
1
17
.

16.

Langmead
B
,
Salzberg
SL
.
Fast gapped-read alignment with bowtie 2
.
Nat Methods
2012
;
9
(
4
):
357
9
.

17.

Wilton
R
,
Szalay
AS
.
Short-read aligner performance in germline variant identification
.
Bioinformatics
2023
;
39
(
8
):
1
11
.

18.

McKenna
A
,
Hanna
M
,
Banks
E
, et al.
The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data
.
Genome Res
2010
;
10
:
1297
303
.

19.

Tarasov
A
,
Vilella
AJ
,
Cuppen
E
, et al.
Sambamba: fast processing of NGS alignment formats
.
Bioinformatics
2015
;
31
(
12
):
2032
4
.

20.

Van der Auwera
GA
,
Carneiro
MO
,
Hartl
C
, et al.
From FastQ data to high-confidence variant calls: the genome analysis toolkit best practices pipeline
.
Curr Protoc Bioinformatics
(October):10.1–10.33
2013
.

21.

Mose
LE
,
Perou
CM
,
Parker
JS
.
Improved indel detection in DNA and RNA via realignment with ABRA2
.
Bioinformatics
2019
;
35
(
17
):
2966
73
.

22.

DePristo
MA
,
Banks
E
,
Poplin
R
, et al.
A framework for variation discovery and genotyping using next-generation DNA sequencing data
.
Nat Genet
2011
;
43
(
5
):
491
8
.

23.

Li
H
,
Wren
J
.
Toward better understanding of artifacts in variant calling from high-coverage samples
.
Bioinformatics
2014
;
30
(
20
):
2843
51
.

24.

Koboldt
DC
,
Chen
K
,
Wylie
T
, et al.
VarScan: variant detection in massively parallel sequencing of individual and pooled samples
.
Bioinformatics
2009
;
25
(
17
):
2283
5
.

25.

Poplin
R
,
Chang
PC
,
Alexander
D
, et al.
A universal snp and small-indel variant caller using deep neural networks
.
Nat Biotechnol
2018
;
36
:
983
7
.

26.

Luo
R
,
Wong
C-L
,
Wong
Y-S
, et al.
Exploring the limit of using a deep neural network on pileup data for germline variant calling
.
Nat Mach Intell
2020
;
2
(
4
):
220
7
.

27.

Garrison
E
,
Marth
G
.
Haplotype-based variant detection from short-read sequencing
arXiv
.
2012
;
1
9
.

28.

Kim
S
,
Scheffler
K
,
Halpern
AL
, et al.
Strelka2: fast and accurate calling of germline and somatic variants
.
Nat Methods
2018
;
15
(
8
):
591
4
.

29.

Cooke
DP
,
Wedge
DC
,
Lunter
G
.
A unified haplotype-based method for accurate and comprehensive variant calling
.
Nat Biotechnol
2021
;
39
:
885
92
.

30.

Friedman
S
,
Gauthier
L
,
Farjoun
Y
,
Banks
E
.
Lean and deep models for more accurate filtering of SNP and INDEL variant calls
.
Bioinformatics
2020
;
36
(
7
):
2060
7
.

31.

Yun
T
,
Li
H
,
Chang
PC
, et al.
Accurate, scalable cohort variant calls using DeepVariant and GLnexus
.
Bioinformatics
2020
;
36
(
24
):
5582
9
.

32.

Zook
JM
,
Catoe
D
,
McDaniel
J
, et al.
Extensive sequencing of seven human genomes to characterize benchmark reference materials
.
Scientific Data
2016
;
3
:
160025
.

33.

Baid
G
,
Nattestad
M
,
Kolesnikov
A
, et al.
An extensive sequence dataset of gold-standard samples for benchmarking and development
bioRxiv
,
2020
.

34.

Li
H
,
Bloom
JM
,
Farjoun
Y
, et al.
New synthetic-diploid benchmark for accurate variant calling evaluation
.
Nat Methods
2018
;
223297
.

35.

Roy
S
,
Coldren
C
,
Karunamurthy
A
, et al.
Standards and guidelines for validating next-generation sequencing bioinformatics pipelines: a joint recommendation of the Association for Molecular Pathology and the College of American Pathologists
.
J Mol Diagn
2018
;
20
(
1
):
4
27
.

36.

Marshall
CR
,
Chowdhury
S
,
Taft
RJ
, et al.
Best practices for the analytical validation of clinical whole-genome sequencing intended for the diagnosis of germline disease
.
NPJ Genom Med
2020
;
5
(
1
).

37.

Souche
E
,
Beltran
S
,
Brosens
E
, et al.
Recommendations for whole genome sequencing in diagnostics for rare diseases
.
Eur J Hum Genet
2022
;
30
(
9
):
1017
21
.

38.

Guo
Y
,
Ye
F
,
Sheng
Q
, et al.
Three-stage quality control strategies for DNA re-sequencing data
.
Brief Bioinform
2013
;
15
(
6
):
879
89
.

39.

Jun
G
,
Flickinger
M
,
Hetrick
KÂN
, et al.
Detecting and estimating contamination of human DNA samples in sequencing and Array-based genotype data
.
Am J Hum Genet
2012
;
91
(
5
):
839
48
.

40.

Zhang
F
,
Flickinger
M
,
Gagliano
SA
, et al.
Ancestry-agnostic estimation of DNA sample contamination from sequence reads
.
Genome Res
2020
;
30
(
2
):
185
94
.

41.

Pedersen
BS
,
Quinlan
AR
.
Who’s who? Detecting and resolving sample anomalies in human DNA sequencing studies with Peddy
.
Am J Hum Genet
2017
;
100
(
3
):
406
13
.

42.

Guo
Y
,
Zhao
S
,
Sheng
Q
, et al.
Multi-perspective quality control of Illumina exome sequencing data using QC3
.
Genomics
2014
;
103
(
5–6
):
323
8
.

43.

Wenhan
L
,
Gauthier
LD
,
Poterba
T
, et al.
CHARR efficiently estimates contamination from DNA sequencing data
bioRxiv
.
2023
.

44.

Pauper
M
,
Kucuk
E
,
Wenger
AM
, et al.
Long-read trio sequencing of individuals with unsolved intellectual disability
.
Eur J Hum Genet
2021
;
29
(
4
):
637
48
.

45.

Rausch
T
,
Zichner
T
,
Schlattl
A
, et al.
DELLY: structural variant discovery by integrated paired-end and split-read analysis
.
Bioinformatics
2012
;
28
(
18
):
333
9
.

46.

Chen
X
,
Schulz-Trieglaff
O
,
Shaw
R
, et al.
Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications
.
Bioinformatics
2016
;
32
(
8
):
1220
2
.

47.

Layer
RM
,
Chiang
C
,
Quinlan
AR
,
Hall
IM
.
LUMPY: a probabilistic framework for structural variant discovery
.
Genome Biol
2014
;
15
(
6
):
1
19
.

48.

Collins
RL
,
Brand
H
,
Karczewski
KJ
, et al.
A structural variation reference for medical and population genetics
.
Nature
2020
;
581
(
7809
):
444
51
.

49.

Cleal
K
,
Baird
DM
.
Dysgu: efficient structural variant calling using short or long reads
.
Nucleic Acids Res
2022
;
50
(
9
):
E53
.

50.

Minoche
AE
,
Lundie
B
,
Peters
GB
, et al.
ClinSV: clinical grade structural and copy number variant detection from whole genome sequencing data
.
Genome Med
2021
;
13
(
1
):
1
19
.

51.

Mahmoud
M
,
Gobet
N
,
Cruz-Dávalos
DI
, et al.
Structural variant calling: the long and the short of it
.
Genome Biol
2019
;
20
(
1
):
1
14
.

52.

Cameron
DL
,
Di Stefano
L
,
Papenfuss
AT
.
Comprehensive evaluation and characterisation of short read general-purpose structural variant calling software
.
Nat Commun
2019
;
10
(
1
):
1
11
.

53.

Cameron
DL
,
Schröder
J
,
Penington
JS
, et al.
GRIDSS: sensitive and specific genomic rearrangement detection using positional de Bruijn graph assembly
.
Genome Res
2017
;
27
(
12
):
2050
60
.

54.

Sarwal
V
,
Niehus
S
,
Ayyala
R
, et al.
A comprehensive benchmarking of WGS-based deletion structural variant callers
.
Brief Bioinform
2022
;
23
(
4
):
1
12
.

55.

Zarate
S
,
Carroll
A
,
Mahmoud
M
, et al.
Parliament2: accurate structural variant calling at scale
.
GigaScience
2021
;
9
(
12
):
1
9
.

56.

Zarrei
M
,
MacDonald
JR
,
Merico
D
,
Scherer
SW
.
A copy number variation map of the human genome
.
Nat Rev Genet
2015
;
16
(
3
):
172
83
.

57.

Liu
Z
,
Roberts
R
,
Mercer
TR
, et al.
Towards accurate and reliable resolution of structural variants for clinical diagnosis
.
Genome Biol
2022
;
23
(
1
):
1
25
.

58.

Talevich
EA
,
Shain
H
,
Botton
T
,
Bastian
BC
.
CNVkit: genome-wide copy number detection and visualization from targeted DNA sequencing
.
PLoS Comput Biol
2016
;
12
(
4
):
1
18
.

59.

Babadi
M
,
Fu
JM
,
Lee
SK
, et al.
GATK-gCNV enables the discovery of rare copy number variants from exome sequencing data
.
Nat Genet
55
(
9
):
1589
97
.

60.

Corominas
J
,
Smeekens
SP
,
Nelen
MR
, et al.
Clinical exome sequencing–mistakes and caveats
.
Hum Mutat
2022
;
43
(
8
):
1041
55
.

61.

Gordeeva
V
,
Sharova
E
,
Babalyan
K
, et al.
Benchmarking germline CNV calling tools from exome sequencing data
.
Sci Rep
2021
;
11
(
1
):
1
11
.

62.

Yao
R
,
Zhang
C
,
Tingting
Y
, et al.
Evaluation of three read-depth based CNV detection tools using whole-exome sequencing data
.
Mol Cytogenet
2017
;
10
(
1
):
1
7
.

63.

Zhao
L
,
Liu
H
,
Yuan
X
, et al.
Comparative study of whole exome sequencing-based copy number variation detection tools
.
BMC Bioinformatics
2020
;
21
(
1
):
1
10
.

64.

Moreno-Cabrera
JM
,
del Valle
J
,
Castellanos
E
, et al.
Evaluation of CNV detection tools for NGS panel data in genetic diagnostics
.
Eur J Hum Genet
2020
;
28
(
12
):
1645
55
.

65.

Gabrielaite
M
,
Torp
MH
,
Rasmussen
MS
, et al.
A comparison of tools for copy-number variation detection in germline whole exome and whole genome sequencing data
.
Cancer
2021
;
13
(
24
):
1
21
.

66.

Fowler
A
,
Mahamdallie
S
,
Ruark
E
, et al.
Accurate clinical detection of exon copy number variants in a targeted NGS panel using DECoN
.
Wellcome Open Res
2016
;
1
(
May
):
1
12
.

67.

Povysil
G
,
Tzika
A
,
Vogt
J
, et al.
Panelcn.MOPS: copy-number detection in targeted NGS panel data for clinical diagnostics
.
Hum Mutat
2017
;
38
(
7
):
889
97
.

68.

Laver
TW
,
De Franco
E
,
Johnson
MB
, et al.
SavvyCNV: genome-wide CNV calling from off-target reads
.
PLoS Comput Biol
2022
;
18
(
3
):
1
16
.

69.

Coutelier
M
,
Holtgrewe
M
,
Jäger
M
, et al.
Combining callers improves the detection of copy number variants from whole-genome sequencing
.
Eur J Hum Genet
2022
;
30
(
2
):
178
86
.

70.

Shirley
MD
,
Tang
H
,
Gallione
CJ
, et al.
Sturge–weber syndrome and port-wine stains caused by somatic mutation inGNAQ
.
New Engl J Med
2013
;
368
(
21
):
1971
9
.

71.

King
DA
,
Jones
WD
,
Crow
YJ
, et al.
Mosaic structural variation in children with developmental disorders
.
Hum Mol Genet
2015
;
24
(
10
):
2733
45
.

72.

Qin
L
,
Wang
J
,
Tian
X
, et al.
Detection and quantification of mosaic mutations in disease genes by next-generation sequencing
.
J Mol Diagn
2016
;
18
(
3
):
446
53
.

73.

Darby
CA
,
Fitch
JR
,
Brennan
PJ
, et al.
Samovar: single-sample mosaic single-nucleotide variant calling with linked reads
.
iScience
2019
;
18
:
1
10
.

74.

Fang
L
,
Kao
C
,
Gonzalez
MV
, et al.
LinkedSV for detection of mosaic structural variants from linked-read exome and genome sequencing data.
Nat Commun
2019
;
10
(
1
):5585.

75.

Wang
Y
,
Bae
T
,
Thorpe
J
, et al.
Comprehensive identification of somatic nucleotide variants in human brain tissue
.
Genome Biol
2021
;
22
(
1
):
1
32
.

76.

Benjamin
D
,
Sato
T
,
Cibulskis
K
, et al.
Calling somatic SNVs and Indels with Mutect2.
bioRxiv
.
2019
;
861054
.

77.

Lander
ES
,
Linton
LM
,
Birren
B
, et al.
Initial sequencing and analysis of the human genome
.
Nature
6822
,
2001
;
409
:
860
921
.

78.

Venter
JC
,
Adams
MD
,
Myers
EW
, et al.
The sequence of the human genome
.
Science
2001
;
291
(
5507
):
1304
51
.

79.

Morales
J
,
McMahon
AC
,
Loveland
J
, et al.
The value of primary transcripts to the clinical and non-clinical genomics community: survey results and roadmap for improvements
.
Mol Genet Genomic Med
2021
;
9
(
12
):
1
8
.

80.

Auton
A
,
Abecasis
GR
,
Altshuler
DM
, et al.
A global reference for human genetic variation
.
Nature
2015
;
526
(
7571
):
68
74
.

81.

Fu
W
,
O’Connor
TD
,
Jun
G
, et al.
Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants
.
Nature
2013
;
493
(
7431
):
216
20
.

82.

Lek
M
,
Karczewski
KJ
,
Samocha
KE
, et al.
Analysis of protein-coding genetic variation in 60,706 humans
.
Nature
2016
;
536
(
7616
):
285
91
.

83.

Zheng-Bradley
X
,
Streeter
I
,
Fairley
S
, et al.
Alignment of 1000 genomes project reads to reference assembly GRCh38
.
GigaScience
2017
;
6
(
7
):
1
8
.

84.

Karczewski
KJ
,
Francioli
LC
,
Tiao
G
, et al.
The mutational constraint spectrum quantified from variation in 141,456 humans
.
Nature
2020
;
581
(
7809
):
434
43
.

85.

Chen
S
,
Francioli
LC
,
Goodrich
JK
, et al.
A genome-wide mutational constraint map quantified from variation in 76,156 human genomes
bioRxiv
.
2022
.

86.

Guo
Y
,
Dai
Y
,
Hui
Y
, et al.
Improvements and impacts of GRCh38 human reference on high throughput sequencing data analysis
.
Genomics
2017
;
109
(
2
):
83
90
.

87.

Pan
B
,
Kusko
R
,
Xiao
W
, et al.
Similarities and differences between variants called with human reference genome HG19 or HG38
.
BMC Bioinformatics
2019
;
20
(
Suppl 2
).

88.

Ballouz
S
,
Dobin
A
,
Gillis
JA
.
Is it time to change the reference genome?
Genome Biol
2019
;
20
(
1
):
159
.

89.

Nurk
S
,
Koren
S
,
Rhie
A
, et al.
The complete sequence of a human genome
.
Science
2022
;
376
(
6588
):
44
53
.

90.

Aganezov
S
,
Yan
SM
,
Soto
DC
, et al.
A complete reference genome improves analysis of human genetic variation
.
Science
2022
;
376
(
6588
):
eabl3533
.

91.

Vollger
MR
,
Guitart
X
,
Dishuck
PC
, et al.
Segmental duplications and their variation in a complete human genome
.
Science
2022
;
376
(
6588
):
eabj6965
.

92.

Altemose
N
,
Logsdon
GA
,
Bzikadze
AV
, et al.
Complete genomic and epigenetic maps of human centromeres
.
Science
2022
;
376
(
6588
):
eabl4178
.

93.

Barbitoff
YA
,
Bezdvornykh
IV
,
Polev
DE
, et al.
Catching hidden variation: systematic correction of reference minor allele annotation in clinical variant calling
.
Genet Med
2018
;
20
(
3
):
360
4
.

94.

Shukla
HG
,
Bawa
PS
,
Srinivasan
S
.
hg19KIndel: ethnicity normalized human reference genome
.
BMC Genomics
2019
;
20
(1):459.

95.

Magi
A
,
D’Aurizio
R
,
Palombo
F
, et al.
Characterization and identification of hidden rare variants in the human genome
.
BMC Genomics
2015
;
16
(1):340.

96.

Sherman
RM
,
Salzberg
SL
.
Pan-genomics in the human genome era
.
Nat Rev Genet
2020
;
21
(
4
):
243
54
.

97.

Sirén
J
,
Monlong
J
,
Chang
X
, et al.
Pangenomics enables genotyping of known structural variants in 5202 diverse genomes
.
Science
2021
;
374
(
6574
).

98.

Rakocevic
G
,
Semenyuk
V
,
Lee
WP
, et al.
Fast and accurate genomic analyses using genome graphs
.
Nat Genet
2019
;
51
(
2
):
354
62
.

99.

Olson
ND
,
Wagner
J
,
Dwarshuis
N
, et al.
Variant calling and benchmarking in an era of complete human genome sequences
.
Nat Rev Genet
2023
;
24
:
464
83
.

100.

Mokry
M
,
Feitsma
H
,
Nijman
IJ
, et al.
Accurate SNP and mutation detection by targeted custom microarray-based genomic enrichment of short-fragment sequencing libraries
.
Nucleic Acids Res
2010
;
38
(
10
):e116.

101.

Barbitoff
YA
,
Polev
DE
,
Glotov
AS
, et al.
Systematic dissection of biases in whole-exome and whole-genome sequencing reveals major determinants of coding sequence coverage
.
Sci Rep
2020
;
10
(
1
):
1
13
.

102.

Clark
MJ
,
Chen
R
,
Lam
HYK
, et al.
Performance comparison of exome DNA sequencing technologies
.
Nat Biotechnol
2011
;
29
(
10
):
908
14
.

103.

Meienberg
J
,
Bruggmann
R
,
Oexle
K
,
Matyas
G
.
Clinical sequencing: is WGS the better WES?
Hum Genet
2016
;
135
(
3
):
359
62
.

104.

Chilamakuri
CS
,
Lorenz
S
,
Madoui
M-A
, et al.
Performance comparison of four exome capture systems for deep sequencing
.
BMC Genomics
2014
;
15
:
449
.

105.

Lelieveld
SH
,
Spielmann
M
,
Mundlos
S
, et al.
Comparison of exome and genome sequencing Technologies for the Complete Capture of protein-coding regions
.
Hum Mutat
2015
;
36
(
8
):
815
22
.

106.

Zhou
J
,
Zhang
M
,
Li
X
, et al.
Performance comparison of four types of target enrichment baits for exome DNA sequencing
.
Hereditas
2021
;
158
(
1
):
1
12
.

107.

Belova
V
,
Shmitko
A
,
Pavlova
A
, et al.
Performance comparison of Agilent new SureSelect all exon v8 probes with v7 probes for exome sequencing
.
BMC Genomics
2022
;
23
(
1
):
4
11
.

108.

Ebbert
MTW
,
Jensen
TD
,
Jansen-West
K
, et al.
Systematic analysis of dark and camouflaged genes reveals disease-relevant genes hiding in plain sight
.
Genome Biol
2019
;
20
.

109.

Jia
T
,
Munson
B
,
Allen
HL
, et al.
Thousands of missing variants in the UK biobank are recoverable by genome realignment
.
Ann Hum Genet
2020
;
84
(
3
):
214
20
.

110.

Chin
CS
,
Wagner
J
,
Zeng
Q
, et al.
A diploid assembly-based benchmark for variants in the major histocompatibility complex
.
Nat Commun
2020
;
11
(
1
):
1
9
.

111.

Wagner
J
,
Olson
ND
,
Harris
L
, et al.
Curated variation benchmarks for challenging medically relevant autosomal genes
.
Nat Biotechnol
2022
;
40
(
5
):
672
80
.

112.

Wagner
J
,
Olson
ND
,
Harris
L
, et al.
Benchmarking challenging small variants with linked and long reads
.
Cell Genomics
2022
;
2
(
5
):
100128
.

113.

Olson
ND
,
Wagner
J
,
McDaniel
J
, et al.
PrecisionFDA Truth Challenge V2: calling variants from short and long reads in difficult-to-map regions. Cell
Genomics
2022
;
2
(
5
).

114.

Richards
S
,
Aziz
N
,
Bale
S
, et al.
Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology
.
Genet Med
2015
;
17
(
5
):
405
24
.

115.

Nykamp
K
,
Anderson
M
,
Powers
M
, et al.
Sherloc: a comprehensive refinement of the ACMG–AMP variant classification criteria
.
Genet Med
2017
;
19
(
10
):
1105
17
.

116.

Cingolani
P
,
Platts
A
,
Wang
LL
, et al.
A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3
.
Fly
2012
;
6
(
2
):
80
92
.

117.

Wang
K
,
Li
M
,
Hakonarson
H
.
ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data
.
Nucleic Acids Res
2010
;
38
(
16
):
1
7
.

118.

Yang
H
,
Wang
K
.
Genomic variant annotation and prioritization with ANNOVAR and wANNOVAR
.
Nat Protoc
2015
;
10
(
10
):
1556
66
.

119.

McLaren
W
,
Gil
L
,
Hunt
SE
, et al.
The Ensembl variant effect predictor
.
Genome Biol
2016
;
17
(
1
):
1
14
.

120.

McCarthy
DJ
,
Humburg
P
,
Kanapin
A
, et al.
Choice of transcripts and software has a large effect on variant annotation
.
Genome Med
2014
;
6
(
3
):
26
.

121.

Yen
JL
,
Garcia
S
,
Montana
A
, et al.
A variant by any name: quantifying annotation discordance across tools and clinical databases
.
Genome Med
2017
;
9
(
1
):
1
14
.

122.

Pruitt
KD
,
Tatusova
T
,
Maglott
DR
.
NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins
.
Nucleic Acids Res
2007
;
35
(
SUPPL. 1
):
61
5
.

123.

Dalgleish
R
,
Flicek
P
,
Cunningham
F
, et al.
Locus reference genomic sequences: an improved basis for describing human DNA variants
.
Genome Med
2010
;
2
(
4
):
1
7
.

124.

Taliun
D
,
Harris
DN
,
Kessler
MD
, et al.
Sequencing of 53,831 diverse genomes from the NHLBI TOPMed program
.
Nature
2021
;
590
(
7845
):
290
9
.

125.

Howe
KL
,
Achuthan
P
,
Allen
J
, et al.
Cristina Guijarro-Clarke, Leanne haggerty, Anmol Hemrom. Ensembl 2021
.
Nucleic Acids Res
2021
;
49
(
D1
):
D884
91
.

126.

Frankish
A
,
Diekhans
M
,
Jungreis
I
, et al.
Gencode 2021
.
Nucleic Acids Res
2021
;
49
(
D1
):
D916
23
.

127.

Schoch
K
,
Tan
QKG
,
Stong
N
, et al.
Alternative transcripts in variant interpretation: the potential for missed diagnoses and misdiagnoses
.
Genet Med
2020
;
22
(
7
):
1269
75
.

128.

Morales
J
,
Pujar
S
,
Loveland
JE
, et al.
A joint NCBI and EMBL-EBI transcript set for clinical genomics and research
.
Nature
2022
;
604
(
7905
):
310
5
.

129.

Skitchenko
RK
,
Kornienko
JS
,
Maksiutenko
EM
, et al.
Harnessing population-specific protein truncating variants to improve the annotation of loss-of-function alleles
.
bioRxiv
,
2020
.

130.

Boomsma
DI
,
Wijmenga
C
,
Slagboom
EP
, et al.
The genome of the Netherlands: design, and project goals
.
Eur J Hum Genet
2014
;
22
(
2
):
221
7
.

131.

Barbitoff
YA
,
Khmelkova
DN
,
Pomerantseva
EA
, et al.
Expanding the Russian allele frequency reference via cross-laboratory data integration: insights from 7,452 exome samples.
medRxiv
.
2022
.

132.

Barbitoff
YA
,
Skitchenko
RK
,
Poleshchuk
OI
, et al.
Whole-exome sequencing provides insights into monogenic disease prevalence in Northwest Russia
.
Mol Genet Genomic Med
2019
;
7
(
11
):e964.

133.

Wang
Q
,
Pierce-Hoffman
E
,
Cummings
BB
, et al.
Landscape of multi-nucleotide variants in 125,748 human exomes and 15,708 genomes
.
Nat Commun
2020
;
11
(
1
):
1
13
.

134.

Cheng
SJ
,
Shi
FY
,
Liu
H
, et al.
Accurately annotate compound effects of genetic variants using a context-sensitive framework
.
Nucleic Acids Res
2017
;
45
(
10
):
e82
8
.

135.

Cummings
BB
,
Karczewski
KJ
,
Kosmicki
JA
, et al.
Transcript expression-aware annotation improves rare variant interpretation
.
Nature
2020
;
581
(
7809
):
452
8
.

136.

Lonsdale
J
,
Thomas
J
,
Salvatore
M
, et al.
The genotype-tissue expression (GTEx) project
.
Nat Genet
2013
;
45
(
6
):
580
5
.

137.

Pagel
KA
,
Pejaver
V
,
Lin
GN
, et al.
When loss-of-function is loss of function: assessing mutational signatures and impact of loss-of-function genetic variants
.
Bioinformatics
2017
;
33
(
14
):
i389
98
.

138.

Ng
PC
,
Henikoff
S
.
SIFT: predicting amino acid changes that affect protein function
.
Nucleic Acids Res
2003
;
31
(
13
):
3812
4
.

139.

Vaser
R
,
Adusumalli
S
,
Leng
SN
, et al.
SIFT missense predictions for genomes
.
Nat Protoc
2016
;
11
(
1
):
1
9
.

140.

Adzhubey
IA
,
Schmidt
S
,
Peshkin
L
, et al.
A method and server for predicting damaging missense mutations
.
Nat Methods
2010
;
7
(
4
):
248
9
.

141.

Kircher
M
,
Witten
DM
,
Jain
P
, et al.
A general framework for estimating the relative pathogenicity of human genetic variants
.
Nat Genet
2014
;
46
(
3
):
310
5
.

142.

Ioannidis
NM
,
Rothstein
JH
,
Pejaver
V
, et al.
REVEL: an ensemble method for predicting the pathogenicity of rare missense variants
.
Am J Hum Genet
2016
;
99
(
4
):
877
85
.

143.

Korvigo
I
,
Afanasyev
A
,
Romashchenko
N
,
Skoblov
M
.
Generalising better: applying deep learning to integrate deleteriousness prediction scores for whole-exome SNV studies
.
PLoS ONE
2018
;126532.

144.

Liu
X
,
Li
C
,
Mou
C
, et al.
dbNSFP v4: a comprehensive database of transcript-specific functional predictions and annotations for human nonsynonymous and splice-site SNVs
.
Genome Med
2020
;
12
(
1
):
1
8
.

145.

Pejaver
V
,
Urresti
J
,
Lugo-Martinez
J
, et al.
Inferring the molecular and phenotypic impact of amino acid variants with MutPred2
.
Nat Commun
2020
;
11
(
1
).

146.

Pejaver
V
,
Byrne
AB
,
Feng
BJ
, et al.
Calibration of computational tools for missense variant pathogenicity classification and ClinGen recommendations for PP3/BP4 criteria
.
Am J Hum Genet
2022
;
109
(
12
):
2163
77
.

147.

Jin
P
,
Yan
K
,
Ye
S
, et al.
Case report: a synonymous mutation in NF1 located at the non-canonical splicing site leading to exon 45 skipping
.
Front Genet
2021
;
12
:
10
5
.

148.

Apetrei
A
,
Molin
A
,
Gruchy
N
, et al.
A novel synonymous variant in exon 1 of GNAS gene results in a cryptic splice site and causes pseudohypoparathyroidism type 1A and pseudo-pseudohypoparathyroidism in a French family
.
Bone Reports
2021
;
14
:
101073
.

149.

Zeng
Z
,
Aptekmann
AA
,
Bromberg
Y
.
Decoding the effects of synonymous variants
.
Nucleic Acids Res
2021
;
49
(
22
):
12673
91
.

150.

Zhang
S
,
Chen
S
,
Qin
H
, et al.
Novel genotypes and phenotypes among Chinese patients with Floating-Harbor syndrome
.
Orphanet J Rare Dis
2019
;
14
(
1
):
144
.

151.

Turkunova
ME
,
Barbitoff
YA
,
Serebryakova
EA
, et al.
Molecular genetics and pathogenesis of the floating harbor syndrome: case report of long-term growth hormone treatment and a literature review
.
Front Genet
2022
;
13
.

152.

Miller
MP
,
Parker
JD
,
Rissing
SW
,
Kumar
S
.
Quantifying the intragenic distribution of human disease mutations
.
Ann Hum Genet
2003
;
67
(
6
):
567
79
.

153.

Laddach
A
,
Ng
JCF
,
Fraternali
F
.
Pathogenic missense protein variants affect different functional pathways and proteomic features than healthy population variants
. PLoS Biol
2021
;
19
(4):e3001207.

154.

Ellingford
JM
,
Ahn
JW
,
Bagnall
RD
, et al.
Recommendations for clinical interpretation of variants found in non-coding regions of the genome
.
Genome Med
2022
;
14
(
1
):
1
19
.

155.

Davydov
EV
,
Goode
DL
,
Sirota
M
, et al.
Identifying a high fraction of the human genome to be under selective constraint using GERP++
.
PLoS Comput Biol
2010
;
6
(
12
).

156.

Siepel
A
,
Bejerano
G
,
Pedersen
JS
, et al.
Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes
.
Genome Res
2005
;
15
(
8
):
1034
50
.

157.

Shihab
HA
,
Rogers
MF
,
Gough
J
, et al.
An integrative approach to predicting the functional effects of non-coding and coding sequence variation
.
Bioinformatics
2015
;
31
(
10
):
1536
43
.

158.

Jaganathan
K
,
Panagiotopoulou
SK
,
McRae
JF
, et al.
Predicting splicing from primary sequence with deep learning
.
Cell
2019
;
176
(
3
):
535
548.e24
.

159.

Zhang
X
,
Wakeling
M
,
Ware
J
,
Whiffin
N
.
Annotating high-impact 5|$\prime $|-untranslated region variants with the UTRannotator
.
Bioinformatics
2021
;
37
(
8
):
1171
3
.

160.

Filatova
A
,
Reveguk
I
,
Piatkova
M
, et al.
Annotation of uORFs in the OMIM genes allows to reveal pathogenic variants in 5’UTRs
.
Nucleic Acids Res
2023
;
51
(
3
):
1229
44
.

161.

Wenger
AM
,
Guturu
H
,
Bernstein
JA
,
Bejerano
G
.
Systematic reanalysis of clinical exome data yields additional diagnoses: implications for providers
.
Genet Med
2017
;
19
(
2
):
209
14
.

162.

Salfati
EL
,
Spencer
EG
,
Topol
SE
, et al.
Re-analysis of whole-exome sequencing data uncovers novel diagnostic variants and improves molecular diagnostic yields for sudden death and idiopathic diseases
.
Genome Med
2019
;
11
(
1
):
1
8
.

163.

Tan
NB
,
Stapleton
R
,
Stark
Z
, et al.
Evaluating systematic reanalysis of clinical genomic data in rare disease from single center experience and literature review
.
Mol Genet Genomic Med
2020
;
8
(
11
):
1
19
.

164.

Fung
JLF
,
Yu
MHC
,
Huang
S
, et al.
A three-year follow-up study evaluating clinical utility of exome sequencing and diagnostic potential of reanalysis. NPJ
Genomic Medicine
2020
;
5
(
1
).

165.

Dolgalev
I
,
Sedlazeck
F
,
Busby
B
.
DangerTrack: a scoring system to detect difficult-to-assess regions
.
F1000Research
2017
;
6
.

166.

Turro
E
,
Astle
WJ
,
Megy
K
, et al.
Whole-genome sequencing of patients with rare diseases in a national health system
.
Nature
2020
;
583
(
7814
):
96
102
.

167.

Greene
D
,
Genomics England Research Consortium
,
D
, et al.
Genetic association analysis of 77,539 genomes reveals rare disease etiologies
.
Nat Med
2023
;
29
:
679
88
.

168.

Marwaha
S
,
Knowles
JW
,
Ashley
EA
.
A guide for the diagnosis of rare and undiagnosed disease: beyond the exome
.
Genome Med
2022
;
14
(
1
):
1
22
.

169.

Wang
T
,
Antonacci-Fulton
L
,
Howe
K
, et al.
The human Pangenome project: a global resource to map genomic diversity
.
Nature
2022
;
604
(
7906
):
437
46
.

170.

Liao
WW
,
Asri
M
,
Ebler
J
, et al.
A draft human pangenome reference
.
Nature
2023
;
617
(
7960
):
312
24
.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.