-
PDF
- Split View
-
Views
-
Cite
Cite
James M Holt, Christopher T Saunders, William J Rowell, Zev Kronenberg, Aaron M Wenger, Michael Eberle, HiPhase: jointly phasing small, structural, and tandem repeat variants from HiFi sequencing, Bioinformatics, Volume 40, Issue 2, February 2024, btae042, https://doi.org/10.1093/bioinformatics/btae042
- Share Icon Share
Abstract
In diploid organisms, phasing is the problem of assigning the alleles at heterozygous variants to one of two haplotypes. Reads from PacBio HiFi sequencing provide long, accurate observations that can be used as the basis for both calling and phasing variants. HiFi reads also excel at calling larger classes of variation, such as structural or tandem repeat variants. However, current phasing tools typically only phase small variants, leaving larger variants unphased.
We developed HiPhase, a tool that jointly phases SNVs, indels, structural, and tandem repeat variants. The main benefits of HiPhase are (i) dual mode allele assignment for detecting large variants, (ii) a novel application of the A*-algorithm to phasing, and (iii) logic allowing phase blocks to span breaks caused by alignment issues around reference gaps and homozygous deletions. In our assessment, HiPhase produced an average phase block NG50 of 480 kb with 929 switchflip errors and fully phased 93.8% of genes, improving over the current state of the art. Additionally, HiPhase jointly phases SNVs, indels, structural, and tandem repeat variants and includes innate multi-threading, statistics gathering, and concurrent phased alignment output generation.
HiPhase is available as source code and a pre-compiled Linux binary with a user guide at https://github.com/PacificBiosciences/HiPhase.
1 Introduction
For human genomics, phasing variants within a gene is a critical aspect of clinical testing for pharmacogenomics (Caspar et al. 2021), HLA typing (Mayor et al. 2019), and autosomal recessive conditions (Tewhey et al. 2011). Additionally, automated tertiary analysis tools can benefit from knowing whether two variants are in cis or trans within a gene, potentially reducing the time for clinical analysts to reach a molecular diagnosis (Danecek and McCarthy 2017). Beyond clinical diagnostics, there are many additional applications of phasing in genomics, many of which are summarized in a review by Browning and Browning (2011).
Haplotype phasing generally falls into three categories: statistical phasing (Browning and Browning 2011), pedigree-based phasing (Browning and Browning 2011), and read-backed phasing (Snyder et al. 2015). For read-backed phasing, each read is mapped to a reference genome and variants are called from the read mappings. The alleles of heterozygous variants are phased into a single diplotype composed of two haplotypes. Once phased, the variants can be fed into various tertiary analysis tools for downstream analyses. In practice, reads that are long and accurate, such as PacBio HiFi reads (Wenger et al. 2019), include a large number of informative bases (more than one per 1 kb of read length) overlapping heterozygous variants making them ideal for variant calling followed by read-backed phasing.
For long reads, WhatsHap (Patterson et al. 2015) is the most prominent read-backed phaser and the de facto standard in the field for phasing single-nucleotide variants (SNVs) and short indels. WhatsHap uses a “fixed parameter tractable approach,” meaning that it down-samples the sequencing reads to a fixed coverage value, to phase the heterozygous variants (Patterson et al. 2015). Despite being generally efficient and accurate for phasing a human genome, WhatsHap does have some notable limitations: (i) it down-samples the data in order to run efficiently, potentially biasing the data; (ii) multi-allelic variants (sites with two or more non-reference alleles) are not phased; (iii) it does not span gaps in coverage caused by large deletions or reference gaps; (iv) it is single-threaded, so parallel processing requires some engineering overhead; and (v) it was designed primarily for small variants like SNVs and short indels. Recently, some tools have focused on that last limitation by either overlaying structural variants (insertions and deletions 50 bp) on the WhatsHap phase blocks (Mahmoud et al. 2021) or jointly phasing SNVs and structural variants, but not short indels (Lin et al. 2022). To our knowledge, no tool officially supports tandem repeat variants. This creates a notable gap for a tool that jointly phases SNVs, indels, structural, and tandem repeat variants.
Here, we introduce HiPhase, a new tool that jointly phases SNVs, indels, structural, and tandem repeat variants called from PacBio HiFi sequencing on diploid organisms. HiPhase uses two novel approaches to solve the phasing problem: (i) dual mode allele assignment and (ii) a phasing algorithm based on the A* search algorithm (Hart et al. 1968). HiPhase offers additional benefits, such as: (i) no down-sampling of the data, (ii) support for multi-allelic variation, (iii) logic to span coverage gaps with supplementary alignments, (iv) innate multi-threading, (v) built-in statistics gathering, and (vi) support for assigning aligned reads to a haplotype (“haplotagging”) while phasing. In our assessment, HiPhase produced an average phase block NG50 of 480 kb with 929 switchflip errors and fully phased 93.8% of genes.
2 Materials and methods
HiPhase breaks the phasing problem into three major components: phase block generation, allele assignment, and diplotype solving.
2.1 Phase block generation
Phase block generation works by identifying pairs of consecutive heterozygous variant calls connected by at least one read mapping. A phase block is initialized by taking the first available variant on a chromosome and creating a single-variant block from it. Then, the next variant is loaded and HiPhase searches for mappings that span both that new variant and the current block. If no mappings span both, the algorithm will then check for supplementary mappings from the new variant into the current block. This supplementary mapping check allows HiPhase to span coverage gaps caused by homozygous deletions and reference gaps. If at least one spanning or supplementary mapping is identified, then, the current phase block is extended to include the variant. Otherwise, the current block is returned as a putative phase block and a new single-variant block is initialized from the new variant. This process repeats until all heterozygous variants on the chromosome are consumed.
Each putative phase block acts as an isolated sub-problem in the full solution. Because the putative blocks are unconnected by the read mappings at their ends, they represent a lower-bound on the number of phase blocks in the final solution. Importantly, each sub-problem can be solved independently, allowing for allele assignment and diplotype solving to be performed in parallel for each putative phase block. This forms the basis for multi-threading in HiPhase.
2.2 Allele assignment
Allele assignment converts all read mappings within a putative phase block into condensed allelic observations, which are chains of reference or alternate alleles corresponding to the observed alleles within a particular read mapping. This simplifies a long-read sequence down to a smaller set of integer values representing alleles that are present within the read. We refer to these assignments as reference or alternate alleles, but HiPhase also allows for ambiguity, unassigned alleles, and multi-allelic variation (two different alternate alleles at one position). Additionally, each observed allele is assigned a “weight” indicating the cost to alter or ignore that allele in the diplotype solving step.
HiPhase has two modes for allele assignment: local re-alignment and global re-alignment. In brief, local re-alignment assigns alleles and quality based on a small window around each variant position, which is conceptually similar to the allele assignment process of WhatsHap (Patterson et al. 2015). In contrast, global re-alignment will fully re-align the mapping against a local alt-aware reference graph using a graph-aware version of the wavefront algorithm (Marco-Sola et al. 2020). In general, local re-alignment is a faster process, but it is ill-suited for accurate allele assignment for large structural and tandem repeat variants and even some smaller indels. Global re-alignment tends to be slower but is more accurate when it comes to allele assignment, especially around structural and tandem repeat variants. Finally, HiPhase implements a “dual mode” allele assignment where if global re-alignment is too slow, it will fall back on local re-alignment for the phase block.
Once all mappings have been converted to a condensed allele representation, HiPhase collapses mappings with the same read name into a single entry. The primary purpose of this step is to create a bridge between supplementary mappings that span a gap in coverage. This allows HiPhase to cross deletion events and reference gaps bridged by split read mappings. If the mappings for one read overlap but have a conflicting allele assignment, then that allele is converted to an ambiguous allele assignment in the collapsed representation. In the end, each read is represented exactly once in the collection of condensed alleles for the phase block.
2.3 Diplotype solving
Diplotype solving distills the condensed allelic observations into two representative haplotypes that are expected to complement each other (i.e. where one is the reference allele, the other is the alternate allele). For our purposes, we define the core diplotype solving problem as a slight reformulation of the weighted minimum error correction (wMEC) problem as described previously (Patterson et al. 2015). Given a matrix, where each row corresponds to the condensed allele representation of one read in the phase block, the goal is to find two haplotypes, and , such that the cost of changing each row to exactly match either or is minimized. We refer the reader to Patterson et al. (2015) for greater technical details around this and other formulations of the phasing/wMEC problem.
HiPhase uses a version of the A* search algorithm (Hart et al. 1968) to search for the optimal diplotype. In general, A* search algorithms explore a given search space by iteratively expanding the current lowest cost option based on both an observed cost and heuristic cost estimate. In HiPhase, the search space is all possible diplotype solutions modeled as a search tree, and the heuristic is generated by chaining sub-problem solutions. While increased coverage and variant density will increase the run-time of this algorithm, in practice, we find that HiPhase can operate efficiently on the typical 30× human whole genome without down-sampling. Greater details on all HiPhase methods, heuristics, and limitations can be found in the Supplementary Material.
2.4 Data
The following sections outline public resources used in our analysis. Details on and links to each can be found in the Supplementary Material.
2.4.1 Benchmark datasets
For benchmarking phase accuracy, we used publicly available Genome in a Bottle phased variant sets v4.2.1 (GRCh38) for HG001, HG002, and HG005 (Wagner et al. 2022). We note that these files are limited to human autosomes, and do not provide phasing information for structural variants.
2.4.2 HiFi sequencing datasets
We used publicly available whole genome, HiFi sequencing datasets for evaluating HiPhase. Three replicates for the sample HG002 sequenced to ∼30-fold coverage on the Revio system were obtained from https://downloads.pacbcloud.com/public/revio/2022Q4/.
2.4.3 Pre-processing
We followed the recommended practices for pre-processing our HiFi datasets through variant calling. In brief, we aligned reads to GRCh38 using pbmm2 and then performed variant calling with DeepVariant with the PacBio model (Poplin et al. 2018), pbsv, and TRGT (Dolzhenko et al. 2023).
2.5 Phasing metrics
We measured the quantity of phasing errors with switches and flips (Patterson et al. 2015). To measure phase block length, we calculated NG50 using GRCh38 for reference length. We additionally gathered the total number of phased variants, number of phased structural and tandem repeat variants (HiPhase only), and the percentage of fully phased genes. We note that NG50 and number of phased variants are measured using tool-specific outputs. All analyses were performed across the full autosomes. Further details on how all metrics are defined, calculated, and collected are available in the Supplementary Material.
2.6 Tool descriptions
We focus our analyses on comparing the current approach for HiFi read-backed phasing, WhatsHap (Patterson et al. 2015), to our new tool, HiPhase. For WhatsHap, we provide small variants from DeepVariant and instruct the tool to phase both SNVs and indels. For HiPhase, we provide the same DeepVariant VCF file, a structural variant VCF file from pbsv, and a tandem repeat variant VCF file from TRGT. We set HiPhase to use global re-alignment (see Section 2.2), which is our recommended approach for phasing with structural variants. Additional results for other modes of running both WhatsHap and HiPhase are available in the Supplementary Material.
3 Results
On average, HiPhase generated 480 kb phase block NG50, fully phased 93.8% of genes, and phased 3.1 M variants with only 929 switchflip errors (see Fig. 1), a rate of 1 error every 3.3 K variants. Additionally, HiPhase generated over 11 K phased structural variants and 68 K phased tandem repeat variants per replicate, a feature that is not present in WhatsHap. Additional analyses with more metrics and samples are available in the Supplementary Material.

Summary comparison metrics. Each line represents a phase result where the line style corresponds to a HG002 replicate and the color corresponds to the phasing approach (top three blue lines for HiPhase or bottom three red lines for WhatsHap). Each metric axis is oriented such that better results are toward the top (i.e. switches and flips are inverted). In all metrics, results from HiPhase were strictly better than those from WhatsHap
To our knowledge, HiPhase is the first phasing tool to jointly phase SNVs, indels, structural, and tandem repeat variants. Compared to WhatsHap, HiPhase generated longer phase blocks (480 versus 407 kb) with fewer phasing errors (929 versus 1615 switchflips), phased more total variants (3.05 versus 2.94 million), and fully phased more genes (93.8% versus 87.0%). HiPhase also includes usability features, such as innate multi-threading, statistics gathering, and concurrent phased alignment output generation. In future versions of the tool, we will extend HiPhase into other forms of structural variants, such as inversions and copy number variants.
Acknowledgements
The authors thank the anonymous reviewers and early HiPhase testers for their valuable suggestions. They thank Justin Zook for access to an assembly-based draft HG002 benchmark that includes larger structural and tandem repeat variants. They also thank Tim Dunn for updating vcfdist to support analysis of phased structural and tandem repeat variants from the draft benchmark.
Author contributions
All authors conceived HiPhase; J.M.H. and C.T.S. developed HiPhase; J.M.H. and W.J.R. analyzed the results; and J.M.H., C.T.S., W.J.R., and M.E. wrote and reviewed the manuscript.
Supplementary data
Supplementary data are available at Bioinformatics online.
Conflict of interest
All authors are employed by and hold stock in PacBio.
Funding
None declared.
Data availability
HiPhase is released as source code and a pre-compiled binary file on GitHub: https://github.com/PacificBiosciences/HiPhase. Additional instructions and use cases are available in the accompanying documentation. Details and links to all benchmark files, sequencing datasets, tool versions, and commands used in this document as well as additional dataset analyses are available in the Supplementary Material.